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ABSTRACT 


Parallel imp@ementations ofM@Baceebouricer Transtorms 
(FFTs) and other fast transforms are represented using 
factored, partitioned matrices. The factored matrix 
description of a distributed FFT is introduced using a 
decimation=in—-time @ DED) FET alcommenme Siata Dlsewtotm 
implementation on a distributed signal processor. 

The heart of the matrix representation of distributed 
fast transforms is the use of permutatons of an NxN identity 
matrix to describe the required inter-processor data 
transfers on the Butterfly Network. The properties of 
these "transfer matrices" and the resulting output ordering 
are discussed in detail. The factored-matrix representation 
is then used to show that the Fast Hartley Transform (FHT) 
and the Walsh-Hadamard Transform (WHT) are supported by 


the Butterfly Network. 
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I. INTRODUCTION 


The Fast Fourier Transform (FFT) has become popular 
Maleace at computesethe Discrete Fourier Transform (DFT ) 
much faster than direct evaluation of the DFT definition. 
While evaluation of the DFT from the definition requires 
nA complex multiplications, the FFT Veco deine Eo dice S 


the same result with the number of multiplications 


proportional to N log, N. The DFT can be expressed as a 


vector-matrix equation: 
nk 
X(k) = W cal} 5 n, k=O03...,N-1 Cia) 
where W = exp[-j2/{/N]. Che 


The Nxl column vectors x(n) and X(k) are the input sequence 
and its DFT, respectively. The record length, N, is assumed 
to be a power of 2. The NxN matrix wo represents the DFT 
operation on the input data sequence that yields the 
transformed output sequence. This thesis uses matrix 

beet Or@zation of the DEFT matrix to deseribe a parallel 
multi-processor implementation of the FFT. After a detailed 
investigation into the properties of the necessary inter- 
processor data transfers, the Fast Hartley Transform (FHT) 
and the Walsh-Hadamard Transform (WHT) are also investigated 


for compatibility with the parallel-processing network. 


A. BACKGROUND 

The Cooley-Tukey FFT algorithm can be conveniently 
viewed as a factorization of the DFT matrix which it 
implements. Many authors have used this approach to 
illustrate the computational savings afforded by the 
FFT [Refs. 1=5]2 Each factor sotemiemtactonred—matni1 
representation corresponds to one butterfly stage of the 
FFT. The product of matrix factors is equivalent to 
evaluating the DFT by its definition, though the rows or 
columns of the FFT matrix product are reordered in 
comparison to the DFT matrix. The row (column) reordering 
corresponds to permutation of the output (input) sequence 
ordering, respectively. The keys to the speed advantage 
of the FFT are that the matrix factors are sparse 
[Ref. 3: p. 328], and that the compilexewelzhtine factors 
are pervodic {Ref. 4: p. 358]. ~~ Phesemprovertices pepmae 
the calculation of the DFT coefficients to be carried out 
iteratively. 

A more recent approach to enhance signal processing 
of large data sets is parallel processing. This concept 
was used to develop the Distributed Signal Processor (DISP) 
at MIT Lincoln Laboratory [Ref. 6]. The DISP uses nodal 
processors interconnected by a Butterfly Network to perform 
real-time signal processing. The input data set is first 


divided equally among the processors. Then all processors 


simultaneously perform the same operations on different 
segments of data. Finally, all processors simultaneously 
exchange partially-processed data through the Butterfly 
Network. The last two steps are repeated until the task is 
completed. Since many signal processing applications involve 
convolution and FFT algorithms, the inter-processor network 
was selected for efficient handling of these algorithms. 
The Butterfly Network allows all nodal processors to 
simultaneously transfer the necessary data among themselves 
wWietmemt: contizet [Ref. 6: p. 1]. References 6-10 discuss 
in detail the architecture of the DISP and the properties 
of the Butterfly Network. It is assumed that an FFT 
algorithm is carried out in parallel by P active processors 
to compute the DFT of a set of N data points. It is 
qoesmmed that N and P are integer powers of 2, N>2P 

Mere: poo. Therefore, each processor cOofitains 

M=N/P points, where M is also a power of 2. 

A new algorithm for the computation of power spectra 
memthe Fast Hartley Transform (FHT) introduced by 
Bracewell. Requiring only real arithmetic computations, 
the FHT performs even faster than the FFT [Refs. 11-12]. 
The Walsh-Hadamard Transform (WHT), on the other hand, 
uses Walsh functions (square waves) as its basis 


functions. Since the Walsh functions are square waves, 


Sievetakcesoneonly two Values; namely +1 or -1 [Ref. 13: p. 225]. 


Both the FHT and WHT are studied for parallel implementation 


using the Butterfly Network. 


B... RESEARCH OBUECGI2 i. 

The primary objective of this research is to investigate 
the suitability of the Butterfly wletwork For muler—processen 
implementation of fast transform algorithms. This thesis 
develops a matrix representation of the multi-processor 
FFT (MPFFT) suitable for implementation on the Distributed 
Signal Processor. The MPFFT algorithm is based on a radix-2, 
in-place, decimation-in-time (DIT) algorithm with input in 
bit-reversed order {Ref. 6: p. 28]. It is felt that the 
matrix representation provides insight into the output 
reordering caused by the required transfers on the Butterfly 
Network. It will be shown that the transfer patterns can 
be represented by permutations of an NxN identity matrix. 
This makes the "transfer matrices" easy to generate, and 
leads to a description of the output reordering as 
a function of the number of Ne DR OECROO RE - After the 
transfer patterns and properties of the transfer matrices 
are examined in detail, the FHT and WHT algorithms are 
investigated for potential support by the Butterfly Network. 

To permit concentration on the matrix manipulations, all 
normalizing factors of (1/N) are neglected. With the DFT 
definition used here, a factor of (1/N) is required in the 


definition of the inverse DFT, Other fast transforms also 
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require normalizing factors in either the forward or 
inverse transform definitions. This factor, in either 
direction, can easily be inserted, as necessary, as a 
gain multiplying each output element. In the MPFFT matrix 
representations of Sectien II, the column indices are 
repeatedly referred to as time indices. While this 
Pemmrnology 1S correct for the first stage, it is simply 
a convenience when referring to the partially-processed 
data at the second and successive computation stages. This 
terminology is used because it is felt that it helps rather 
than hinders the discussion. 

The multi-processor FFT is presented in Section II as 
a partitioning of the single-processor DIT FFT. The 
required inter-procesor data transfers are described by 
matrices, and their properties are examined in detail. An 
aleerithm is presented for generating the transfer matrices 
as permutations of identity matrices. The output reordering 
caused by the transfers is then discussed, and an algorithm 
is developed for determining output ordering as a function 
of N and P. Alternative transfer ee are investigated, 
and the pattern chosen in reference 6 is justified. The 
decimation-in-frequency (DIF) FFT is examined, and a 
general matrix equation for the MPFFT is developed. 

Section III investigates several other fast transforms 


for compatibility with the Butterfly Network. The 


let 


relatively new Fast Hartley Transform (FHT) is partitioned 
for parallel processing using the Distributed Signal 
Processor. Finally, the Walsh and Hadamard orderings of 
the Walsh-Hadamard Transform (WT) are studied. Section 

[IV presents a unified approach to the algorithms for 
transfer matrix generation and corresponding output 
reordering, and also shows how the two separate algorithms 


from Section [LI can be combined into a single algorithn. 


iy 


JI. FFT MATRIX FACTORIZATIONS 


This section introduces shorthand notation to simplify 
the matrix manipulations to follow, and presents a repre- 
Semeetton of a dastributed FFT algorithm using partitioned, 
factored matrices. The matrix version of the distributed 
FFT will be shown to involve permutations of identity 
matrices to describe the necessary data transfers. These 
"transfer matrices" possess many interesting properties 
that will be discussed in detail. 

Many authors have used matrices to describe FFT 
algorithms [Ref. 1: pp. 148-149 and 2: pp. 63-69]. In fact, 
the initial derivation of the FFT depended on the fact that 


an NxN matrix, none of whose elements is zero, can be 


fetomed 1nto Sparse matrices with many zeros [Ref. 3: p. 297]. 


A Discrete Fourier Transform defined by 
Meieye—= 5 x(n) exp[-j(27/N)nk] k=0,1,...,N-1 (e2zsles) 
=(0 


can be represented as a vector matrix equation 


E 
X(k) = W" x(n), Ce» 


@iemwew Wo = exp(—j27/N) feand Eis a matrix of exponents. fhe 


input sequence is the vector 2 (Cy given by 
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x(n) = '| x( 0) Gl eee ZCN=1 ae (2) 


and the transformed output sequence X(k) is 


~e 


XC) = [XCO) ely eccrine ae (2.4) 


~~ 


The vectors x(n) and X(k) may represent sequences in either 


normal order, or in digit-reversed order. In several forms of 


the radix-2 FFT algorithms a normal-ordered input 

data sequence produces output in bit-reversed order. 
Similarly, a bit-reversed input data sequence yields a 
transformed output Sequence in normal order. 

The matrix ry is a two-dimensional array representing 
the transform operation. It has row indices k=0,1,...,N-l, 
and column indices n=0), 1,20 N—-loeeoimecewEne  Ewildale ster ores 
W = exp(-j27/N) is common to all the elements of the DFT 
matrix rs a matrix of exponents E can be written, whose 


elements are functions of k and n. For example, for N=8, 


(2.1) gives 


¥(0) = [WO woo ay mic wenn mre ea ene 
XC1) = (WO wow Ge we ave al mec aie ee) 
ss C2459) 
x2) = [wo wo wo WOow mice nant omeeenD 
x(7) = [wo wl w® we we we w? we] x(n) 
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ine metrise. form, 


(Ck) “Wx (n)= a WC” Ore 
oe ie oy! oe 
a i 
i Vi ei 
and e = k 2 3 4 5 6 
0 eo a. oc 
1 2 Z| ee = <6 
2 i ice Vom 4: 
3 6 1 4 7 2 C2 
4 Oa Ono 
5 2 Veer: 
6 in 9 Moe. 
7 oe CED, 





Note that each element in the matrix of exponents, ese ee 
product of the row label, k, and the column label, n, computed 
modulo N (i.e., k:n mod N = remainder of k-n/N). The compu- 
tation of the kn products mod N is due to the fact that as 


PSmperiodie with = period N, i.e., 


LS 


AMER ING SEMEL), ftls Mel 3 O,0i, oo (2.7a) 


The periodicity of whee is one of the keys to the FFT 
[Ref. 4: p. 358]. In general, for an N-point DFT, the E 


matrix of exponents is NxN, with entries: 


9] 

k Oy tan N-2 N-1 

0 0 0 

1 N-2 N-1 

2 N-4 N-2 

E = (2230) 

N-2 4 - 

N-1 2 l 





Aeon ORTHANDBNOTA GION 

One way to derive FFT algorithms 1S tOetaGeor the DET 
Ma@eri x Ww into a product of matrices. This method is also 
well-suited to the description of distributed FFT algorithms, 
and will be the method used here. 

Elliott and Rao have developed a shorthand notation that 
makes the matrix factorization easy to follow [Ref. 2: 


pp. 47-49]. To illustrate, consider the 4-point DFT defined 


Daye: 
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E 


X(k) = W x(n) k=0,1,2,3 

. a‘ (2.9) 
N=0" 1.5.2 wo 

imi Matrix form, 
X(0) fv oy x(0) 
X(1) wo wl oy? (al) 
= » (210) 
C2) we we ww mo?) 
1x(3) wo owe ow? nC 





This can be manipulated into a decomposition-in-time (DIT) 
E 
FFT by reordering the columns of W as follows: 





X(0) we wo UW?! OW x(0) ae x(0) 
X(1) wo owe wh oweP | x(2) == x(2) 
= > = e 
X(2) Ww WO Wl WA (1) =i x(1) 
Ix(3) Wo we Wo i «(3) =e x(3) 
CP AS 
where W = exp(-j27/4) = -j, and the n index values of the 


input sequence have been correspondingly reordered as well. 
Ie 
iitemintdtrcixew can be factored to yield 


Eo 
W = WwW - W a 
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where wt and Ww? will represent the first and second 
stages, respectively, of butterfly computations in the 


DIT FEI Seer 








0 1 0 2 
BE. = and Ey = 0 G22 13) 
0 3 0 
Then, 
we we 0 1 
E, 0 we sg we 1 0 
wee ; ; z Orie 
W 0 W ° ® il 
0 we Ww 1 0 
and 
we 0 1 8 
. we OO 0 1 6 
a 7 
a 0 wow? 0 1 
0 wo ow? One 





where the shorthand notation of a dot (no entry) in the 
BD matrices of exponents corresponds to an element value 
E 


of zero in the yo matrices. Performing the matrix 


multiplicat Homnor (2. bo pesy cumdc 
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[oro po OM OTe | 1,0 cowie wo | 
| | 
; yOrO yOH2 yltO ylt2] [0 42 yl “| 
yOTO yOtO 2tO 42t0) [0 0 2 yA 
OO rem S20 2 | 3 a 
(Das) 


BR 

Note that W~ is factored such that only one nonzero entry 
E 

per row of W : moemultaplied by onlgg one nonzero entry of 


E 
any given column of W a Thasseresult is true for all 


fmiemomed | Frl Matrices |Ref. 2: Dae48). Since W isma 

common factor, the exponents simply add (modulo N) when the 
Deatnpienmlltipnlication is performed, i.e., W 
Therefore, an easier way to carry out the matrix multiplica- 
Erontmot (2.i12) is by the addition of appropriate exponents 


fe loo bet the shorthand notation 


E=E,* E£ (2.16) 
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mean the matrix of exponents derived from the sum of 
appropriate elements of the factored matrices of exponents 


of (2.13), where the elements to be added are defined by the 





row-times-column rule of matrix multiplication. Thus, 
ae 
9) : 0 9) 
: O ; 2 
E=E,*E = 
=e oO . 2 0 
: 0 : 9) 
| 0+0 OE TOMNOE TO MNO ZO fo 600i 
O+O0 O+2 1+0 1+2 9) Z 1 
= = (2. 
O+O0 O0+0 2+0 2+0 9) 0) 2 
OF ORO 20 ee a 0 Z 3 
and the result is the same as the exponents of (2.15). It is 


usually much simpler to work with the matrices of exponents, 
than to actually perform the mare mu leepercacton. ofmeec 
all the necessary information is contained in the matrices 
of exponents, this section will deal mainly with this 
representation. 
B. FACTORED-MATRIX REPRESENTATIONS OE SINGER -PR@GrsSoUkert & 

(SPFFT) 

E 


Factoring the DFT matrix W of (2.2) and reordering the 


rows and/or columns leads to various FFT algorithms. The 
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factored FFT matrices represent individual stages of the 
Demetreular albsorithms. The concept of matrix factorization 
will be developed in this section to describe a 
sangle-processor, radix-2, DIT FFT. 

To en eee an N-point) FF P@CEommete al powem of 2), there 
are log.N "conputation Matrix staeitors,.each having 2 
nonzero elements per row. Each computation matrix represents 
the butterfly computations carried out at a given stage. 
The number of nonzero elements per row depends on the radix 
base of the algorithm. For example, a radix-3 FFT compu- 
tation matrix contains 3 nonzero elements per row, radix-4 
matrices have 4 nonzero elements per row, etc. A flow 
GuaeeramerOor amy G-point radix-2 DET FFT is shown: in Figure 2,1. 
The vector-matrix equation describing a single-processor 
impmeementation of the 8=point DIT FFT is: 


E Ey By & 
X(k) = W x(n) = W W W x(n) k—-O. i... aes 


G2 . 1hs>) 
where 3 butterfly (computation) stages are Bec ge ey a 
accommlish the) radix-2 FFT. The matrix factors wt; mae 
and = represent the computations performed at stages 


1, 2, and 3, respectively. Let Xirent) represent the output 


of the first stage of butterflies of Figure 2.1. Then, 


Za 


[tr'd:9° yoy] weidetq MOTq Ladd LIG 34Uto”d-g 


N 
(by Ns (dy By= (by Ny (d) B= (b) T+ (b)"x 
Z/N+4 ve Nu 
Rf 
(b) "yn (4) X= (d) EH (4)"X 


:st ATjJs19aqqng y 


(2)x 


& 


(€)x 


(g)x 


© 


D (1)*x 


(9)x 


© 


(Z)x 


@) 


D (h) x 


© 





Do (0) x 


Zia 


Popes Co) Wx (2) = Wx? CO) + wox'(1) 
gee aude ok (1) aux’ (0) + w'x°(1) 
ee WB) sere (2) + wx (3) 
| exes) sore) + w'x°c3) Co 
ee Xk 5) = ok Cae W865) 
ee ne 6 sy ee) = ow x95) 
co = hee = WoX°(7) =ewe x (6dea wex"(7) 
Mee kT) = Wee) + wx°(7) 


O 
where ate = exp|-j(27/r)(4)=-1, ae and X (n)=x(n) is the 
bit~reversed input sequence, i.e., the input to the first 


im inatrix form, 


butterfly stage. 


XCn) 


C2nee 0) 


js 


x“(0) 
em) 
re) 
3) 
K“(4) 
eats) 
x°(6) 


eis 


Similarly, 


the output of the second stage of butterfly 


computations is: 


Gc) a) Gn, 
Ce Cop 
KO) ee 
iC), ee) 
(Ge eon) 
res) 2 HOKE 
OCS) 
ees ae) 
Wa ands Wwe ean 

oe 

5 

we 9 
liny = Jo Ww? 


W a CON 
fe year) 
wex4¢o) 
wext (ay 
woxt (4) 
Wo kes 
Woe ee 


wext¢s) 


wx4(2) 
ae) 
wtx4(2) 
ES 
Wi Gn) 
ee aD 
wx? (6) 


oa 


lmemiatrix form, 
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fo 
0 WwW 
Be 
0 WwW 


(2245) 


x'(n) 


GZ Ze 


The output of the last stage is: 


armen Gna X04) = woxcgo) 4 w°’x*@s) 


ee be x- Gs) = womeeion + wix7(s) 


X°(2) = x*(2)4w2x7(6) = wox202) + w2x2(6) 


9 2 e y) 
¥7(3) = X7(3)+W°x7(7) = wOx2(3) + wex207) 





Rey =F COW k-(4) = WoX°(O) + W°X?(4) oat 
eee awe x-(5) = wox-cd) + wox7(5) 
perder u-%-(6) = wox7(2) + w°x7(6) 
Mee ene ew X-(7) = W°x7C3) + Ww’ x77) 
eee ey = Ww, wo = —<w-, and W’ = -w-. 
Again, in matrix forn, 
0 0 0 we 0 0 0 
Ce) =x3(N) aw" 3x2(.N) = voor oOo O WW oO oO 
LOMO O woo 
> Oo Wo OD i Te 
0 0 0 wo oO 0 {X*(n) 
we § OK et 
9 Wg Oa eee 
MO MO. 0 aw’ 
Ces) 


Compammame (2.20), (2.22), and (2.24) gives the overall matrix 


representation for the 8-point DIT FFT: 
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ae he Wee 


E 1, & 
K(k) = W°x?(n) = wow 2xt(ny = wow Mw PXP(n) (2.25) 


where X(k) is the normal-ordered output sequence, and x°(n) 
is the bit-reversed input sequence. The factored matrices 
of exponents corresponding to the stages of the DIT FFT of 


Figure 2.1 are: 





(2268) 
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Le) 
© 
i) 





3 |: 0 
E. = 4 0 4 (2.26) 
5 0 5) 
6 Q 6 
i 0 7 


Mier eomwonce sdsdin a dot or no entry in the exponent matrices 
| By, 

SOGpEeCsponads tO an entry of zero in the W matrices. 
The column indices are discussed in a subSequent paragraph. 
Performing the matrix multiplication using the shorthand 
notation described earlier gives the result shown in 
Preupes 2.2. Notice that the addition of exponents in (2.27) 
is carried out mod N (N=8). Also observe the sparseness of 
Enewmatrix factors Ba Eo. and E.. Baecheimatnrix tactor 
contains only 2N entries, and only the final overall matrix 
of exponents, E, contains numerical entries in all NxN=Nn 
elements. 

A few comments about the nature of the DIT FFT algorithm 
are in order. When an N-point DFT is divided into two DFT's 
of one-half the original size, the time sequence separation 


of the input of either of the smaller DFTs is increased by 2. 


This corresponds to dropping (decimating) alternating inputs 
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of the original DFT; hence the name decimation in time. The 
idea behind the FFT is thus to break the original N-point 
Sequence into two N/2-point sequences, the DFT's of which 

can be combined to give the DFT of the original N-point 
sequence. Next, each N/2-point sequence is decomposed 

into two N/4-point sequences, and so on, until 2-point 
sequences result. The DFT of a one-point sequence is simply 
the sample itself [Ref. 5: p. 49]. The input sequence 
indices for an N-point DFT are aeparated by 1 unit, based 

on a normalized analysis period, (record length), P = 1 
second. This yields transformed output sequence frequencies 
Separated by |! Hz [Ref. 2: p. 60]. Thus, the column labels, 
n, of the factored matrices of (2.26) are separated by larger 
HicCMetchts Ob time going From right-to-left in Figure 2.1. 
That is, the last stage, E., has column labels separated by 

1 urnait, E, has indices separated by N/4=2units, and the Ey 
column labels are separated by N/2=4 units. Performing the 
matrix multiplications of (2.27) results in a time sequence 
mixing operation, in which the time labels in the matrices 


of exponents add. The matrices E. Pi wwiave salple periods 


2 
O and 1 or O and 2 units, respectively (based on an analysis 
period of P = 1 second). These periods mix so that E. * E, 
has periods 0, 2, 1, and 3 seconds. Likewise, the time 
sequence periods O and 4 seconds in ER combine to yield 


periods 0, 4, 2, 6, 1, 5, 3, and 7 seconds in the overall 


Pacem ncoverail Vit PRI matrix of exponents, E in 


LS 


(2.27) is the same as the DFT matrix of exponents (2.7), 
except for a reordering of the columns. Therefore, the DIT 
FFT matrix can be viewed as a DFT matrix with reordered 
columns. 

C. FACTORED=MATRIX REPRESENTA OF iui i-PpRveHs ow. 

FFT (MPFFT) 

In this section the concept of matrix factorization will 
be extended to the description of distributed FFT algorithms, 
with consideration of the multi-processor FFT implemented on 
the Butterfly Network at M.I.T. Lincoln Laboratory [Ref. 6]. 

Distributed FFT algorithms can also be described in terms 
of row or column permutations of an overall DFT matrix, 
with the inclusion of matrices describing the transfer(s) 
of data among the several processors. Figure 2.1 can be 
partitioned to describe an 8~-point FFT implemented using 4 
processors. The result of this partitioning is shown in 
Figure 252 

Since the DIT FFT algorithm combines data separated by 
1,2,4,...,N/2 memory locations prueermesccane thmouwchmene 
butterfly stages of the algorithm, the required separation 
will eventually exceed the size (M=N/P) of the data block 
held within each processor. (N=number of input data points, 
P~number of active processors; both assumed to be a power of 
2.) At this point, and for each subSequent computation stage, 
an inter-processor data transfer must take place. Kirk and 


Verly introduced the terminology "pre-transfer"™ and 
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"nost-transfer" butterflies to describe these multi-processor 
operations [Ref. 7: pp. 5-6]. The butterfly computations 
that are accomplished prior to any data transfers are called 
pre-transfer butterflies. Those@which@arescarrved Out atten 
data transfers are called post-transfer butterflies. 

Partitioning the N data points equally among P processors 
leads to L¥*=log Jf (M=N/P) stages of pre-transfer butterflies, 
and logoP post-transfer butterfly stages. Since each post- 
transfer butterfly stage requires one data transfer, there 
are also logoP transfer stages. Note that the number of 
transfers depends only on the number of processors, and not 
on the input data record length. 

The pattern of transfers used in reference 7 consists 
of an exchange of the lower half of the data in one processor 
with the upper half of the data in another. (In this context, 
lower and upper refer to local memory addresses, and not to 
relative memory positions.) The destination processor in a 
given transfer is determined Dy the cube. Pune bloenm, which 


complements the ith bit of the source processor's binary 


address [Ref. 8: p. 224]. For example, consider processor 
Py in a four processor (P=4) network. The processor's binary 
address is OO. The cube, function is Ol, and the cube, 


function is 10. The destination processors in the first 


data transfer are determined using the cube, function, those 


0 


in the second transfer by the cube, function. In general, 


1 
the destination processors inv €ne Leh eganstetea he sd oremnlinned 


by the cube’; _1) function | Remmeo <p 4a 


a2 


To determine whether a processor transfers the upper or 
lower half of its data, processors are designated as 
"upper-half" or "lower-half" processors at each transfer 
eee At the ith transfer stage, the (i-1l)st bit of the 
processor's binary address is inspected. If this bit is 
zero, the processor is designated as an upper-half processor, 
and it transfers the upper half of its data at this stage. 
A processor whose (i-l)st bit is one transfers the lower 
half of its data at the ith transfer stage [Ref. 9: p. 74]. 
For example, in Figure 2.3, at the first transfer stage (i=l), 
the Oth bits of processors Po and P. are zero. Therefore, 
Po and Po are upper-half processors; similarly, P, and P 
are lower-half processors at this stage. 

Rrertranster pattern chosen in references 7 and 9 
is not unique, but leads to a regular structure that can be 
described analytically. When these data transfers are 
represented using matrices, the "transfer matrices" possess 
many interesting properties. As will be shown later, the 
data transfers can be described as permutations of an NxN 
identity matrix. Although single-processor algorithms 
generate output Sequences in normal order when the input 
data is in bit-reversed order, the distributed-processor 


algorithm produces data in what reference 7 refers to as 


"ynseudo-normal"™ order. 


518, 


The distributed implementation of an 8-point DIT FFT with 
bit-reversed input data shown in Figure 2.3 can be described 


by the vector~-matrix equation: 


E ee 
X(k) = W W W W W x(n) (2.260) 


where EL» E. and E. are matrices of exponents describing 

the butterfly computations at the first, second, and third 
Stages, respectively. Matrices of this type are referred 

EO ase COMDItALLONeMatmumces = Ty and T5 are matrices 
describing the first and second data transfer stages, and 

are called "transfer matrices". To illustrate the generation 
of the transfer matrices, consider the first data transfer 


Er nye Meh aL fe lye) ZA S) & 











x0 (0) 0 X0(0) 
x7 (0) 0 Xo (0) 
Xo (1) 0 x7 (0) 
a 1 ise a, 
X5 (0) 0 X,(0) 
x5 (0) 0 X5(1) 
x3 (1) 0 X,(0)} (2.29) 
X31) 0 x31) 
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where the superscripts indicate stage number (prime denoting 
a transfer stage), the subscripts denote the processor number, 
and the argument specifies the multi-processor memory 


Poeeteion. in shorthand notation, 


(24530) 





Since me = 1. 

Tsien etweouesMortnand notation £oO carry out the matrix 
multiplication of (2.28) gives the result shown in Figure2.4, 
Note the ordering of the output sequence X(k), indicated by 
the row label, k. One-half of the data in a given processor 
can be read out in natural order before it is necessary to 
move on to the next processor to read the next element of 
the output sequence. Two saaSOS through the processor 
network are thus necessary to read out the entire output 
sequence. This is always true, regardless of the length 


of the output sequence or number of processors employed. 
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8-Point/4-Processor DIT FFT(BR Input) 


Figure 2.4 
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The partitioned computation matrices EL are identical to 
the single-processor case for the stages prior to any data 
transfers (the pre-transfer butterflies). For the post- 
transfer butterfly stages, the rows of the computation 
matrices have been permuted vis-a-vis their single-processor 
commmerparts. (For example, compare the row ordering of 
Ee 


exponents within any given row are identical, though the 


of Figure 2.4 with E, of Figure 2.2.) Notice that the 


ordering of the rows is changed. This reordering is caused 
by the data transfers, and will be examined in detail in the 
next section. The entire purpose for data transfers is to 
get data points that need to be combined in subsequent 

Stage butterflies into the same processor. This is necessary 
when the memory separation between points to be combined in 
a butterfly exceeds that of the data stored in a given 
processor. The most economical way to affect the necessary 
transfers is to transfer only one-half the data at any given 
transfer stage. There are several other constraints on 
potential transfer patterns; these will also be discussed 

in the next section. Finally, observe once again that the 
exponent entries in any computation matrix, E> can be 
determined by the residue of kn modulo N, where k is the row 


label, and n is the column label. 


Sif 


D. PROPERTIES OF TRANSFER MATRICES 

At the heart of the multi-processor FFT (MPFFT) 
algorithm are the inter-processor data transfers that must 
occur. There are log.P inter-processor transfers required 
to compute an FFT implemented using P active processors. 
When the transfers are represented by matrices, the "transfer 
matrices" possess many interesting properties. These 
properties will be examined in detail in this section. The 
required data transfers permute the ordering of the output 
sequence, hence the rows of the FFT matrix. When the FFT 
is implemented on a single processor, the output is produced 
in normal order when the input data is in bit-reversed (BR) 
order. The bit-reversal is a consequence of the decimation- 
in-time FFT algorithm for computing DFTs. The multi- 
processor PFT algorithm produces "pseudo-normal" (PN) 
ordered output when the input sequence is in bit-reversed 
order. Thus, PN ordering is a characteristic of MPFFT 
implementations. This section will show how and why the PN 
ordering comes about. 

1. Transfer Matrix Patterns 

An eight-port butterfly network to interconnect 

eight processors is shown in Figure 2.5. Reference 7 
describes in detail the topology and properties of the 
Buttterfly Network (BN). The transfers that take place on 
the BN are included in the vector-matrix FFT representation 


by additional matrix factors fo be Called) transtcr snare uee—. = 
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The transfer matrices represent the input-output connectivity 
of the BN, where each switch is regarded as a two-port 
network. The inputs to the BN come from output ports of the 


processors Pos P Poi the network's outputs in turn 


pote 
are connected to the input ports of these same processors 
[Ref. 10: p. if. It ts¥eonvenienm too werew the tEnansten 
matrices as permutations of an NxN identity matrix. An 
identity transfer matrix results in effectively no transfer, 
leaving all the data in unchanged locations. When 
representing the actual inter-processor data transfers, 
elements move off the main diagonal ina very regular pattern. 
One-half of the elements remain on the main diagonal. (Recall 
that only half the data is transferred at any given transfer 
stage.) The remaining elements move right (or equivalently, 
down) off the main diagonal or left (up) off the main diagonal. 
The elements move so as to maintain symmetry about the main 
diagonal. This is a convenient property, the consequences 

of which will be discussed later. 

The only parameters that must be known to determine the 
exact form of a particular transfer stage matrix are the 
record length, N, the number of active processors, P, and the 
transfer stage number, R. The number of active processors 
sets the number of transfer stages, loemee Tre. ratio 
N/P=M (the size of the data block in each processor) determines 
the number of elements that remain on the main diagonal, M/2. 


To illustrate the.exact pattern of a transfer matrix, consider 
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a 16-point DIT FFT implemented using 4 active processors. 

The four-processor system requires log,P=2 transfer stages. 
The transfer matrices describing the inter-processor transfer 
are shown in Figure 2.6. The NxN matrices are partitioned 
into blocks of dimension MxM. The first MxM block corresponds 
to processor Pos the next block to Pas etc. Labels are 

added indicating processors corresponding to given rows and 
columns. The local processor memory addresses are listed 
next to the processor addresses. The MxM blocks corresponding 
to individual processors are highlighted. Numerical entries 
not falling into a highlighted processor block (along the 
diagonal) describe the inter-processor transfers. The 
destination processor is indicated by the element's row 
position and the source processor is given by the column 
position. "Upper-half" or “lower-half" processors are 
determined by which column(s) contain the entries that have 
moved outside the processor's block. If the right-most 
columns (larger local memory addresses) corresponding to a 
source processor contain the transferring entries, the 
processor is an "“upper-half" processor at that stage. 
Similarly, left-most column transfer entires correspond to 
"lower-half" processors. Recall the determination of 
"upper-half" and "lower-half" processors discussed in the 
previous section. Processors withthe same (i-1l)st bit of 
their binary addresses exhibit the same pattern at the ith 


transfer stage. In addition, processors that exchange data 
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Gpeeeomiume teansoter stage differ only in the (i-1)st bit of 
their binary addresses. To illustrate these ideas, consider 
paewtranster matrix T, of Figure 2.6. The entries in rows 4 
Gmees a SOCCuUr ime columns 2 and 3, respectively, outside the 

Gees bDiock corresponding to Po. hissed urans the first 
meister Stace, processor Po transfers the upper-half (memory 
ieeataons 2 and 3) of its data to Py (processor corresponding 
to rows where the entries appear). Recall that upper and 

lower refer to the local memory addresses, and not to relative 
Peeetton an memory. Note also that the Oth bit is the only 
bit where Py's and P,'s binary addresses differ. (Bit value=0 
Corresponds’ to upper-half processors, and a 1 indicates 
lower-half processors.) 

The distance moved off the main diagonal for those 
elements shifting is given by the difference between single- 
processor memory separation of points combining in butterflies 
and the separation available in the local processors. Points 


participating in a butterfly inthe single-processor DIT FFT 


are separated by 1,2,4,8,...,N/2 memory locations at successive 
stages. In general, the separation between points participating 
: ra ; 

jmea butterfly at stage Lis 2 . The maximum separation 


available in the MPFFT is M/2. Therefore, elements move 
(261-1), (M/2) places off the main diagonal when representing 
the transfer prior to the Lth butterfly stage. In the 16-point/ 


4-processor example, the first transfer occurs prior to the 


third butterfly stage. Thus, elements shifting off the main 
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(3-1) _ (4/2)) = 2 places right or left/down 


diagonal move (2 
or up. In summary, all the information about an 
inter-processor transfer stage can be determined by 
inspecting the corresponding transter matrix. 
2. Algorithm for Generating Transfer Matrices 

The regular patterns exhibited by the transfer 
matrices make them easy to generate. The row/column positions 
can be expressed as an ordered-pair (i,j). For an identity 
matrix (no transfer), all elements reside on the main 
diagonal and i=j. When actual inter-processor transfers are 
represented, the column index is permuted so that i#j. The 
algorithm for obtaining the column indices as permutations 
of the row indices is as follows. First, the rows and 
columns are numbered from 0 to N-l. Next, the row indices 
are represented as binary numbers. The cotumn indices of 
elements describing inter-processor transfers are obtained 
by switching 2 appropriate bits in the binary representation 
of the row indices. When computing.an N-point MPFFT, the 
matrices are of dimension NxN. Thus, there are log,N bits 
necessary to represent the row/column indices in binary. 
Label thesesbats by aby _yoeeesbys boy: where L=(log,N)-1. The 
bit whose label is (L*-1) is switched with the bit whose 
label depends on the transfer stage number. (L*=log.M=number 
of pre-transfer butterfly stages.) At the first transfer 
stage, for example, bit number (L*-1) is switched with the 


bit indexed one number greater. At the second transfer 
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stage, bit (L*-1) is switched with the bit whose label is two 
greater, etc. This procedure generates the column position, 
ieee tne mwordered—-pairw(i,j}) £rom the row position. For 
example, an 8-point DIT MPFFT implemented on 4 processors 
requires 3 bits to represent the row/column indices 
(eres. ,7. lhese bits are labelled bo,b,, 0° 
L*=log,M=log.,(N/P)=1. To determine the column positions 


b Then 

of the transfer matrix entries, the row indices are permuted 
as described above. That is, for the first transfer stage, 
bits by (L*-1) and b, are switched. At the second transfer 


i 


Stage, bits Do and bo are switched. Thus, the transfer 
matrices for this example are obtained a shown in Figure 2.7. 
Note that switching the appropriate two bits at a given 

stage has no effect on the elements that remain on the main 
diagonal. For elements not transferring, the two appropriate 
bits are the same for the row index at the stage of interest. 
Consider now the powers of 2 that the bits represent. The 
distance shifted off the main diagonal is the difference 
represented by the bit-switching operation. For example, 


a Ae 


when switching b, and b the difference is 2°-2°=1. At 


OQ 
this transfer stage, elements shifting move one place off 
the main diagonal. This is simply the difference between i 
and j when i#¥j. If this algorithm results in a larger 
column index ((i,j) j i), the element in the transfer 
matrix shifts right off the main diagonal. If the row 


index permutation results in i>j, the corresponding element 


shifts left. Right shifts off the main diagonal correspond 
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to ‘upper-half" processors, and left shifts correspond to 
“Yower-half" processors. The bit-switching algorithm holds 
for any number of points and any number of processors. [It 
always involves switching the bit labelled (L*-1) with a 
bit determined by the transfer stage number. Although this 
PeesOELehm works for either DIT or DIF MPFFTs with bit-reversed 
input data sequences, it must be slightly modified to 
accommodate pseudo-normal input. This modification will be 
developed in a later section. 
Pee orenoOgeonality of Transfer“Matrices 

All transfer matrices are symmetric. At each 
transfer stage, each processor retains half its data. The 
corresponding elements of the transfer matrix remain on the 
main diagonal. Each processor that transfers the upper-half 
of its data (transfer matrix elements shifting right) receives 
data from another processor's lower half (elements shifting 
left). At each stage, elements shifting off the main diagonal 
all move the same number of places. This results in a symmetric 
BebEmWUtation of the NxN identity matrix. 

Recall that entries in the transfer matrices only 
bake On numerical values of Gein) or O(.=no entry). This 
property, coupled with the symmetry property, results in 


all transfer matrices being orthogonal. That is, 


Re TR = 


= (W) R=1,2,...,log,P (622 32) 
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Equivalently, 








T 
Or, in thestsiortiandemicegetour 
Mee wl So deas (0. (Dees 
~R  =R sis ; 
For example, consider again tte of the 8-point.4-processor 
DIT MPFFT. Substituting into (Zee es): 
i iE 
wt ee = 
100000 0 0) 190 0 0 OF Cromoior oon o co cama 
O 0 1 Oetreo 00 O O08 0 OF ORORO leo 0 O-0me 
O- ESO 0 OOO 0 OF 1 O30H0 0 030 O OO 0 “OeGe0 
0:0: 0.) 2OR0e Or ® O00 0 50. 04086 OPOR Oe -0 Oat 0 
O20 OF0 L070 50'" |e 0 omomD Om TO 0 ti 0 0 a 
0000001 0 0000001 0 O00 0 0 1 Serre 
O00. 0 Oso OO 0 0 0 cies a0 OO -0r 0-0: 0-1 a6 
O«0 0 0 0 OO js .0 0 0° 0707" 0ies we O O° Ow uw 0est 
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These convenient properties will be put to use later. For 
now, it is sufficient to note that all transfer matrices. 
are orthogonal. 

4. Pseudo-Normal Ordering 

It was previously shown that single-processor FFT 
algorithms can be viewed as row or column permutations of a 
DFT matrix. This section will demonstrate that multi- 
processor FFT implementations can be viewed as further 
non colunnmepermitations of the same basic DFT matrix. The 
additional row/column permutations arise from the inter- 
processor data transfers required in the MPFFT algorithn. 
The inter-processor transfers also cause a reordering of the 
data sequences involved in the MPFFT. 

Inter-processor data transfers arise from the need 
to move partially-processed data points that will be 
combined in succeeding butterfly stages in the same processor. 
When computing a DIT MPFFT, this permutes the rows of 
succeeding computation matrix factors. It will be shown 
later that the inter-processor transfers similarly reorder 
the columns of the decimation-in-frequency (DIF) computation 
matrices. The exponent weighting factors within a given row 
or column remain unchanged; only the row or column ordering 
is changed. As would be expected, the row/column 
permutations also exhibit a regular pattern, based on the 


pattern of the transfer matrices. 
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The rows and columns of the matrix factors are 
labelled with values of k and n, respectively. The column 
labels, n, correspond to the ordering of the input data 
sequence, x(n). For now, only bit-reversed (BR) input 
sequences will be considered. The row labels, k, represent 
the ordering of the output sequence, X(k). Starting with 
the normal-ordered output that is eee bythe =Srknien 
algorithm using BR input, this section describes how the 
normal ordering is transformed into "pseudo-normal" (PN) 
ordering by the inter-procesosr transfers on the Butterfly 
Network. 

Consider a 1l16-point DIT MPFFT implemented on an 
8-processor Butterfly Network (BN). Three transfer stages 
(log.8) are necessary in this computation. The row labels, 
k, undergo the following stage-by-stage permutations: 


T i i 


1 2 3 
k 
0 0 0 0 
(Py) 1 2 4 8 
2 1 1 i 
Cs ee 3 5 9 
4 4 y) 2 
Sen 6 6 10 
6 5 3 3 
(oe 7 7 11 
8 8 8 4 
Copp mee 10 12 12 
10 9 9 5 
(P.) 11 11 ive 13 


ul 
© 


I ileZ 10 6 


(Pe) 13 14 14 14 
14 13 a 7 
Gaia t5 15 ns 15 


Figure 2.8 Normal to Pseudo-normal Ordering 
SON pOlLntLs/ Gmpmoecessors) 

Figure2.8 illustrates which data points are combined in the 
butterfly computations at each stage. Mirine tie werrst 
(and only) pre-transfer butterfly stage, points x(0) and 
woljmare combined in a butterfly imeprocessor Po: The second 
butterfly stage requires separation between points of g62-l) 5 
memory locations. This separation is not available within a 
given processor, so an inter-processor transfer must occur. 
After the first transfer, partially-processed data points 
x(0) and x(2) reside together in Pos and another butterfly 
computation is performed. This procedure continues until 
the computation is completed. The PN output sequence X(k) 
that results is in the order shown after the final ance 
Stage, T.- This pseudo-normal ordering depends only on the 
pecord lLemeth, N, number of active processorswy P, and their 
ratio, M=N/P. The record length determines the total 
number of butterfly stages, logoN. The number of transfer 
stages (logsP) is set by the number of processors. The size 
of the data block in each processor, M, sets the number of 
pre-transfer butterfly stages (log.M=L*) that can be computed 


prior to any transfers. The PN ordering is a function of the 


aul 


number of transfers, hence processors. This is analogous to 
the SPFFT case, where bit-reversal is dependent on the record 
length, N. Note also from Figure 2.8 that one-half the points 
in each processor are output in normal order, after the final 
transfer. Thus, the output Sequence can be rearranged Dy 
additional network transfers, or the processor local memories 
can be read cyclically to obtain the output data in normal 
orderellRet. 72? ape loure 

The reordering of the output sequence, from normal 
to PN ordering, can be derived by a bit switching algorithm 
Similar to that for generating transfer matrix patterns. 
The bits to be switched at each stage are different in this 
case, however. The binary representation of the row labels, 
k, again requires log.N bits to describe the N-point output 
sequence. The bits switched this time always involve two 
adjacent bits, beginning again with the bit labelled (L*-1) 
at the first transfer stage. The labels of the bits to be 
Switched are then each incremented by one at each successive 
transfer stage. Figure 2.9 illustrates the algorithm for 2 
different values of P and M. Note in Breure 2.95 that only 
values of k in each processor are changed at any given stage. 
Also observe that the starting bit to be switched at the 
first transfer stage is different for different values of M. 
This is because fewer active processors compute more pre- 
transfer butterfly stages, and require fewer transfers (less 


reordering). Consider again the powers of 2 represented by 
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Normal to Pseudo-Normal Ordering 
Pseudo-Normal Ordering 


points/4-processors DIT MPFFT (Lx-1) 


16- 


Figure 2.9b 
Figure 2.9 


the bts. Since the 


bits Dy and D> 


the k values change by 2 


first transfer in Figure 2.9a switches 


1402). This 


explains why the bits to be switched are adjacent in this 


case. During each transfer, the network is preparing for 


butterflies requiring separation between points twice that 


available. Thus, bits representing successively higher 


powers of 2 are switched at succeeding stages. 


5. Selectionmet 


The transfer 


Transter Matmewecs 


patterns used in references 7 and 9 


are not unique. Other patterns of transfers were investigated, 


but had the disadvantage of structures for which the general 


pattern could not be 
[Ref. 9: p. 74]. As 
exists regarding the 


tributed FEFIs on the 


easily discerned and described analytically 
would be expected, the same situation 
transfer matrix factors describing dis- 


BN. This section discusses the 


selection of appropriate transfer matrices to represent the 


inter-processor transfers on the BN. 


Several constraints on the make-up of potential 


transfer Matrices cam Deemuenta tao am These constraints are: 


1) Entries take@on valtes vot pions moa 


2) Only one entry per row or column is permitted. 


3) The toplogy of 


4) Data points to 
must reside in 


the Butterfly Network must be satisfied. 


be combined in the next butterfly stage 
the same processor after transer. 


5) One-half the entries remain on the main diagonal. 
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These constraints greatly reduce the number of NxN matrices 
that qualify as transfer matrices. The constraints also 
lead to the transfer matrices being symmetric, hence 
orthogonal. This convenient property will be exploited in 
the next section. 

Beginning with the NxN identity matrix, the elements 
are permuted according tothe constraints listed above to 
determine "legal" transfer patterns. For the 8-point/ 
4—processor DIT MPFFT, this procedure yields only two 
matrices for each transfer stage that meet all the constraints. 
These matrices are shown in Figure 2.10. The transfer matrices 
actually used are shown in Figure 2.10a. The other matrices 
Smee satisty all the constraints are shown in Figure 2.10b. 
When the transfer patterns of Figure 2.10b are used, two 
undesirable properties result. First, since the higher- 
numbered data points occupy the lower addresses in each 
processor memory, the butterflies that follow must be 
inverted, with the twiddle factor paterns upside-down. 
Additionally, following the second transfer, the weighting 
factor patern is scrambled from the order expected. The 
second undesirable result involves the ordering of the 
Output sequence. Instead of normal, bit-reversed, or 
pseudo-normal ordering, all of which can be unravelled, the 
output is in indecipherable order: X(5) X(1) X(4) X(€0) X(7) 


x¥(3) x(6) x(2). In addition to meeting all the constraints, 
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the transfer patterns used in reference 7 also lead to 

a consistent pattern in the twiddle factors in every 
butterfly stage. The PN ordering of the output sequence 

is easy to describe and unravel. Note also that the 
alternate transfer matrices are more difficult to generate. 
That is, the bit-switching algorithm presented in Section 


Mane toe cumo@mm=vork fOr the alternate transfer matrices. 


Eee GENBRALEZED MULTI-PROCESSOR FFTs 

Single-processor FFT (SPFFT) algorithms permute the 
row or column ordering of the DFT matrix which they imple- 
ment. This is turn reorders the output (permuted rows) or 
input (permuted columns) data sequence. The reordering 
caused by SFFT algorithms changes normal-ordered input or 
output sequences into bit-reversed (BR) order. The multi- 
processor implementation of the decimation-in-time (DIT) 
FFT algorithm described in Section II.D.4 has been shown 
to further reorder the rows ofthe basic DFT matrix. This 
further permutes the ordering ofthe transformed output 
sequence. As waS shown, when using bit-reversed input data, 
the DIT multi-processor FFT produces a pseudo-normal (PN) 
ordered output sequence. Pseudo-normal ordering is thus a 
consequence of this multi-processor FFT (MPFFT) algorithn, 
as bit-reversed ordering is associated with SPFFT 
algorithms. This section describes the generalized 


factorization of MPFFT matrices to obtainthe stage-by-stage 
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matrix factors. The DIT MPFFT (BR input/PN output) examined 
previously will be extended to represent the DIT MPFFT 
with PN input ordering, and to decimation-in-frequency (DIF) 
MPFFT algorithms using both BR- and PN-ordered input 
sequences. 

The overall MPFFT matrix of twiddle factor exponents, 
E, is obtained by forming the modulo N product of the 
appropriately ordered row and column labels [Ref. 2: p. 43]. 
The row and column labels represent the ordering of the 
output and input sequences, respectively. A BR input 
sequence to the MPFFT implemented using the Butterfly 
Network produces PN-ordered output. Similarly, a PN input 
data sequence produces transformed output in BR order. This 
is true whether a DIT or DIF MPFFT is being computed, 
although the matrix factors representing individual computa- 
tion stages differ. Thus, after calculating the BR and PN 
sequence ordering for the given number of data points and 
active processors, Theis easy Pomme domme Ve sO ve ima cr 
MP F Ptah 

Once the overall MPFFT matrix is obtained, it must 
be factored to represent the individual computation stages. 
As discussed previously, an N-point FFT requires a total 
Of logoN butterfly stages. When implemented on the multi- 
processor Butterfly Network, there are L*=logoM pre-transfer 
butterfly stages. (Recall that N=record length, P=number 
of active processors, and M=N/P=number of points in each 
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processor.) Therefore, each computation stage is represented 
by one matrix factor, with one transfer matrix prior to each 
post-transfer butterfly computation matrix factor. Each 
matrix factor successively pre-multiplies the factor(s) 


representing the previous stage(s) when the matrix product 


Memcarried our. dhus, the general Form of theeMPFRT is: 
He E i 
E af E = af 7 1 logM 1 
x(k) =W logN,, logP,, C(logN)-1 W Glos r)—-1 2. W W re x(n) 
pest-teranster butterflies and pre—-Erans rer 
transfers butterflies 
(Acai) 


where all logarithms in the matrix indices are base eEwo. 
The remainder of this section discusses how to form each 
Gemeniewmacrix factors in (2.3/7). 
Po ceri with Bit-Remermsed input 

DhemwiPREweematrix, £;, issternme@ by reordering the 
row and column indices as described previously, and then 
forming the modulo N products to give the individual 
Sletegis. lhts matrix must then be factored to obtain 
the representation of each computation and transfer stage. 
The most illustrative case to consider is when L*=logoM=l. 
In this case, there is only one pre-transfer butterfly 
stage, and the maximum number of transfers and post-transfer 
butterfly stages. Each processor contains M=2 data points, 
and is thus capable of performing a 2-point DFT internally. 
At successive stages, each processor can compute one-half 


aaa pommen ET, one-fourth of an S-point DFT, etc., 


a9 


provided that appropriate data points are transferred to 
reside within each individual processor. Other situations 
involving more than 2 points per processor are easily 
described in relation to the L*=1 case. 

Bach NxN matrix factor SammC@2 2575s LiPSe pare leteme 
into blocks of dimension MxM. Each MxM block along the main 
diagonal represents an individual processor. The computation 
stage matrices EB contain entries only in the submatrix 
blocks along the diagonal. The entries in these blocks 
describe the butterfly computations taking place within the 
individual processors. The transfer matrices ie contain 
entries outside the blocks along the diagonal to describe 
the inter~processor data transfers. 

Consider once again the 8-point/4~-processor DIT 
MPFFT using BR input of Figure 2.4. Note that there are a 
total or 3(1log,8) COMputatilon Matrix taeeors. Ez, represents 
the first (pre-transfer) butterfly stage, and there are 
2 (log,4) Matwucenac tome representing the post-transfer 


butterfly stages, each preceded by a transfer matrix. 


Recall that the DIT algorithm decomposes an 8~-point DFT into 


two 4-point DFTs, then into four 2-point Dits. see figure. 


This decomposition partitions the input time sequence into 
subsequences having even and odd indices. At each 
successive decomposition, the butterfly inputs are separated 
by larger increments in time. The first butterfly stage 


combines samples from the time sequence separated by 4 
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units (N/2). Thus, the first tater aeeem E\> has column 
(time) indices 0 and 4. The ordering of the row indices, k, 
at the first stage is the normal ordering characteristic 

of the single-processor DIT FFT with bit-reversed input data. 
The row ordering is then permuted by each inter-processor 
transfer stage until the output sequence ends up in the PN 
ordering characteristic of the MPFFT on the Butterfly 
Network. The second butterfly stage in the DIT FFT combines 
points separated by two memory locations. In this example, 
this separation is not available within a given processor, 
so an inter-processor transfer must take place. This 
reorders the rows of the next computaton stage, and all 
successive stages. The algorithm for reordering the row 
indices was presented in Section II.D.4. The time 
separation between points combining in the second butterfly 
stage is 2 units, so the column labels for this matrix 

Bac bods (E.) are 0 and 2. Note that each twiddle factor 
exponent in the blocks along the diagonal is still obtained 
by the modulo N product of the row and column indices. The 


third (final) computation stage marix factor, E is derived 


=~ 3’ 
similarly. The row ordering is further permuted by the 
second inter-processor transfer. The row indices of the 
final computation stage are now in the PN order in which 
the transformed output sequence appears. The column labels 


are 0 and 1, cotrespondine to Ene bere mts) sean pine. 


separation of 1 unit. It can be verified by comparing 
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More 2.4 wien Pigure 2,2 that identical butterfly computa- 
Eons take place at each stage. That is, each row in the 
SCmpitedetonematnhix Lactors contains identical entries at 
any stage, only the ordering of the rows is changed in the 
MPFFT by the inter-processor transfers. 

pune aeaniattone ot thescolumn labels of the transfer 
matrices is also necessary. A column label of O indicates 
that the element in that column remains on the main diagonal 
(doesn't represent a transfer at that stage). The negatively- 
valued labels indicate elements moving down off the main 
diagonal, and correspond to lower-half processors at the 
transfer stage in question. Similarly, positively-valued 
labels indicate elements moving up from the main diagonal 
(upper-half processors). The magnitude of the transfer 
tier rees column Labels in the DET algorithm corresponds to 
the time separation between points that are to be combined 
in the butterflies at the next computation stage. The row 
indices are permuted by previous transfers the same as the 
piommud1ces of the computation matrix factors. 

iinemmaterx factors in (2.3/7) are called sparse 
matrices because of the numerous zero entries in any row 
aye rorreyd Paki gi The Sparse matrices cut down the number of 
arithmetic operations required to compute the DFT [Ref. 2: 
p. 67]. Multiplying out the matrix factors gives the 
permuted DFT matrix E as the result. Taking the products 


can be regarded as a mixing operation in which the column 
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time indices of the matrix factors add. Successive 
decomposition of the input time sequence results in Ey 
having only time values of O and 4 units, E, having only 
time values of O and 2, and E. having only time values of 
QO and 1. When the matrix factors are multiplied, the time 
values Mix som tm aegis! 


=E has time values of 0,4,2, and 6 


Bt ane 


unwmes: SimPlar ac aking Cee ae  E the mixing 


1? 
results in the product having time values of 0,4,2,60-1.5, 208 
This represents the ordering of the input sequence which was 
successively decomposed at each stage in the factorization 
of “Eto “give = thewpll algo t inte iiromeeca © tye 2 women meet 

DIT MPFFT (multiplying the matrix factors) can be viewed 

as a recombination of the input sequence that was decomposed 
in setting up the algorithm. The row indices of the E 
matrix are determined by the row ordering at the final 
computation stage, and represent the ordering of the 
transformed output sequence, X(k). 

The 8-point/4-processor example (L*=1) can be 
extended in several waysas follows. ocince M=N/Pe there are 
really only two variables in (2.37), the record length, N, 
and the number of active processors, P. The total number 


of butterfly stages must equal the number of pre-transfer 


and post-transfer butterfly stages. Therefore: 


logN = log,M+log.P = L* + logoP { 2eso)) 
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Two situations can result in fewer transfer stages than the 
L*=l1 example just cited. Extending the input record 

length to a higher power of 2 while keeping the number of 
processors constant resultS in one or more additional pre- 
transfer butterfly stages. This situation will also increase 
the total number of butterfly stages, logoN. Also, decreasing 
the number of active processors (again, required to be a 
power of 2) for a constant input sequence length results 

in more pre-transfer butterfly stages. This situation can 
result when processors suffer faults, and there are no spares, 
and leaves the total number of butterfly stages unchanged. 
Both of these situations increase the number of pre-transfer 
Pu@Lterisily staces bye increasing the size of the data block 
within each processor, M. This makes each processor capable 
of performing more than one pre-transfer butterfly stage. 

To illustrate, consider again the 8-point DIT MPFFT, this 
time using only 2 processors, as shown in Figure 2.12. 

tmewe is still a total of 3 (log,8) butterfly stages. Now, 
however, there are 2 (log,(8/2)) pre-transfer butterfly 
stages, and only 1 transfer stage and 1 post-transfer 
butterfly stage. Referring to the algorithm of Section 
II.D.4 to calculate the PN ordering for L*=2, the E matrix 
can be written down. Each of the matrix factors EL is 
divided into submatrix blocks of dimension MxM. Now, 
however, the bicoks will be 4x4. Each processor contains 


enough data to compute 2 pre-transfer butterfly stages. 
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8-Point/2-Processor DIT FFT (BR Input) 


Figure 2.12 
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Since during the pre-transfer stage(s) each processor is 
functioning independently (i.e., as a single-processor), 
the pre-transfer matrices are identical to the SPFFT case. 
Note also that all processors contain the same weighting 
factor pattern prior to any transers, regardless at which 
sereemthe first transfer occurs. 

When the submatrix blocks representing the 
individual processors are of dimension 4x4 or greater, 
there are dots, representing no entry, within each block. 
Thies 9s aWwawagous to the single-processor case, in that 
some or all of the required memory separation between 
points is available within the processors’ local memories. 
When the stage is reached where inter-processor transfers 
are necessary, rows of the succeeding stage matrices are 
permuted, as always. The column labels representing 
time values of each matrix factor are such that each 
processor block contains an equal number of zero and non-zero 
time labels. Regarding the transfer matrices, one or several 
of the transfer stages required in the L*=l1 case are 
"bypassed" as the N/P=M ratio increases. The algorithm 
for generating the transfer matrices presented in Section 
II.D.2 still applies, but the number represented by L*- 


(first bit to be switched) increases. 


on 


2; DIT MPFFI with PNeinoue 

The DIT FFT algorithm can be used on a single 
processor with normal-ordered input, to produce bit- 
reversed output. Similarly, the DIT multi-processor 
algorithm can also be used with a psSseudo-normal input 
sequence. As would be expected, the transformed output is 
produced in bit-reversed order. The MPFFT Matrix E is the 
transpose of the & matrix of the BR input case. This 
reflects the change in th input-output relationship of 
the PN- and BR-ordered sequences. The elements of the E 
matrix are thus stiiP obtained fromecnesmodulo Ni produen 
of the row and column indices representing the output and 
input Sequence ordering. Unfortunately, the matrix factors 
describing individual stages are not simply the transposes of 
the matrix factors from the BR input algorithm. This is 
due to the different ordering of the input sequence. The 
inter-processor transfers occur in reverse order, and for 
the case where L*#l1 (fewer than maximum number of transfemoee 
even occur at different stages. Specifically, for an 
8-point/2-processor DIT algorithn, *=log,M=2. When using 
PN input data, there is one transfer stage (log.P) as 
expected. However, the one transfer is required after the 
first butterfly stage, and not after two pre-transfer 
Stages, as in the BR input case. There are two post-transfer 
butterfly stages, but only one transfer. That is, there is 
no second transfer required between the two post-transfer 


butterfly stages. (Compare FPacume 2Zele wrens bi gunmen eee 
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8-Point/2-Processor DIT FFT (PN Input) 


Faepitite, 2. 13 
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Again, the easiest case to consider is when L*=1. 
There is only one pre-transfer butterfly stage, and each 
post-transfer stage is preceded by an inter-processor 
transfer stage. In this case, though, since the Butterfly 
Network is transforming PN input into BR-ordered output, 
the transfer pattern is the reverse of that in the BR 
input case. The individual transfer matrices are the same, 
since they are symmetric, but they occur in reverse order. 
(Compare T, of Figure 2.14 and T. of Figure 2.4.) The column 
labels once again correspond to the time decomposition of 
the input sequence. For an 8-point DIT, the labels are 
O and 4 at the first stage, O and 2 at the second stage 
and OQ and 1 at the final stage. The column indices of the 
transfer matrices are also the same as before. The non-zero 
magnitudes still indicate time separation at the next 
butterfly stage. Values of zero indicate non-transferring 
elements, and the sign gives direction of movement off the 
main diagonal. The row indices for the first computation 
matrix factor are in what will be called "pseudo-bit- 
reversed" (PBR) order. This is the output ordering that 
would result in a SPFFT using pseudo-normal-ordered input. 
Recall that the exact PN ordering is a function of the 
number of required network transfers, logoP. The same 
is true of PBR ordering, since the transfers ultimately 
permute the PBR ordering into bit-reversed-ordered 


output. Thus, the elements within each matrix factor can 


70 





©) 8: 
d s 
JES oe 


QQ 


ns | ~t'| | 
N | oO ‘4 = wo 
=—— i ee 
oOo | aa) 
ag tu pr Oy 
. ws eS 
a. i toe: 
0, O° oO} lo ol! yy 
a as 


a | © 
(@) | Oo © 
a i 
On | ro ol 

| | 
re) 
saat st ee. 
qq 2 i | 

| 

ro) oO O| 

| 
—--~+4---,--1--- 


eo) | Ds Sites 
a ' | | 
> Ja Toe ae 
ae <- ae 
| 
oO | lo a) e 
ee ae | 
r4 oO} | 
= | See a 
7 “| eS ‘7 e 
ol 5... | 
Oo 
, a 
ON HY Or YOM wW hb 





an a, c-..., - 1 
” 13 | e's, eo 
aoa b= --- 
on 2g gh) Poss: lo ies 
a oO l + le ae) 
ica) ———+-——l_—_;--- 
, © | 2 at SS 
eee) Ge nO | ao lO 
ye ea 
m Oo ;° ‘ ly 
| 
4% og A oOo mo we bk 





No wo ath ow al 


moO toa oO hid w 


| 
wo oly 4,0 Cia 
NED On sci ao © 
Oa e oo Ro 
How Nn ole oon 
= J ee 
+t Oo oloolw~wy¢ 


2.0 olo ojo Clo o 


| ! | 
%o tnornrwn On 


|o oO 
= 22 SS 
rq | is un | 
By : |} ° oO} 

a i | : 
—--+--4---;--- 
Ho w | i 
O70 ol | | 


“own orn OD bk 


(2.41) 


8-Point/4-Processor DIT FFT (PN Input) 


Figure 2.14 


tall 


still be calculated from the modulo WN predict ofnetewea. 
and column indices. The row ordering is permuted by each 
successive transer, as before. At the final computation 
stage, the row indices, k, are in bit-reversed order 
corresponding to the ordering of the transformed output 
sequence. The column labels of the E Matrix Calealain 
be found by summing the time indices of each of the 
individual matuwwxe cactors. 

The situation becomes slightly more complicated 
when L*41. For PN input, there is only one pre-transfer 
butterfly stage before inter-processor transfers become 
necessary. There will be the same number of transfer 
stages as before (BR input case), but they occur in 
reverse order. This leaves log, M=L* computation factors 
at the end without any transfer matrices between them. 
This is analagous to the BR input case, when there were 
L* computation stages prior to any transfers. 

Recall the discussion on PN ordering from Section 
II.D.4. Each processor contains two subsequences of M/2 
points in normal order, each subsequence separated by 
N/2. Since the DIT algorithm combines points separated 
by N/2 at the first butterfly stage, the PN input case 
calculates M/2 2-point DFTs in each processor at the@tiirse 
computation stage. There are (/2}1 dots (no entry) in each 
block submatrix corresponding to the local memory 


separation between points combining in the butterflies. 
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Refer again to Figure 2.13. The butterfly width remains 
constant for successive stages preceded by inter-processor 
transfers, and decreases when computation stages are no 
longer preceded by transfer. Again, recall the BR input 
case where butterfly widths grew to a maximum, determined 
by local memory contents, then remained constant as 
transfer stages occurred. Since the PN input sequence 
does not have points separated by N/4 (second stage DIT 
butterfly width) residing in any processor, the first 
transfer stage occurs after the first butterfly stage. 
Transfers occur until log.P Stages have been carried out. 
The final L* butterfly stages do not require transfers 
between them, as mentioned above. Each transfer 
successively permutes the pseudo-bit-reversed row labels, 
until the bit-reversed output sequence emerges. 
3. Pseudo-Bit-Reversed Ordering 

Further discussion of the nature of pseudo-bit- 
reversed (PBR) ordering is necessary. Just as normal 
ordering is permuted to PN ordering by the DIT MPFFT using 
BR input, PBR ordering serves as the starting point for 
PN input algorithms. Also, PBR order is the ordering of 
pwe owt put that™would result from a SPFFT DIT algorithm 
using PN input ordering. The terms "PN" and "PBR order" 
are really meaningless when talking about single-processor 
implementations, since each is a function of the number of 


inter-processor transfers. However, if a particular 


(es) 


configuration is assumed, say 8 points/4 processors, then 
that. particular PN ordénine, whensinpueetowa, Sei wo ulea 
result inthe corresponding (8/4) PBR-~ordered output. In 
any case, the algorithm of Section II.D.4, which generates 
PN from normal ordering, can also be used to generate PBR 
ordering from bit-reversed order. To use the algorithm 

as presented, the bits must first be reordered to bit- 
reversed ordering. That is, instead of descending order, 
the bits are reordered in ascending order. For example, 

a l6O-point data sequence requires 4 bits to describe the 
set in binary. These bits when bit-reversed are ordered 
Do» Di» Dos Da. Now the algorithm may be applied 

directly to generate psSseudo-~bit-reversed order, for a 
given number of processors. Recall that the algorithm 
Switches bits L*-1 and L* at the first transfer stage, 
then increments both bits at successive stages. Thus, at 
the second stage, bits L* and L*+l are switched, and so on. 
The transfer matrices used for the PN input case occur 

in reverse order from the BR input case. Thus, this 
algorithm can be used in reverse, to give the Stage-by-stage 
row indices, k, and ultimately, the BR-ordering of the 
output sequence. Refer to Table 1 for a display of these 


relationships. 
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TABLE 1: 


16 Points/&8 Processors (DY =log,,M=1) 


babe bs b 


Normal: 


S22 4° 0 
T, ’ by by 
T,: by bo 
Tz: be ae 
PN: 
bobsboby 
Note: 
16 Points/4 Processors (L =2) 
Normal: bbb, 55 
aes al D5 
To: Do bg 
PN: by b3952% 
16 Points/2 Processors (I =3) 
Normal : 55,9) 25 
T, : Do b. 
PN: bobab bo 
8 Points/4 Processors (EI =1) 
Normal: 555, 5, 
To: by bo 
Note: 
8 Points/2 Processors (Ir =2) 
Normal: bob, by 
Ty: by bo 
PN: le) lore te 
2 O 
Note: 
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SEQUENCE REORDERING (DIT MPFFT) 


Om 2-3 
T.’ 29 D4 
1s do 
Ty: 2p 2 

: lo) lope low le: 
on m2 3°0 

Tg=T,, Tp=tp, Ty=T, 
BR: Bpb, bob. 

tee EE Do 
ieee ) Da 
PBR: BaD eee 
BR:  bpb, 5b, 
ie be bs 
PBR: bob, bab. 
BR «bb, 
fig y lero by 
Ty: by bo 
RERo by bob, 
Ty=T,, Tat, 
BR: byb,b, 
Ty: by Do 
PBR: b_b_b 
oD 1 
T,=T, 


4. DIF MPEFEL Aleorithms 

As in single-processor algorithms, the multi- 
processor DIF FFT exhibits) mugen dial ty ene trelationmeean enc 
multi=processorypil FFY. The ViPRRienet aie E can again 
be viewed as a DFT matrix with rows and columns reordered 
corresponding to the output and input sequences. The &E 
matrix of the DIF with PN input/BR outpitwi1s the tramcpose 
of the E matrix of the DIT with sn i npme emote pula 
transpose of a product of Matrices is Ehe productmonerenec 
transposed matrices, in reverse order. The stage matrix 
factors of the DIF with PN input are thus the transpose 
of the factors, in reverse order, of the DIT with BR 
input. The DIF algorithm partitions the output sequence 
into even and odd values of k, the frequency index. The 
row indices now represent frequency separation at the 
individual computation stages. That is, the first matrix 
fac tou Ey: has frequencies 0 and 4. The units of 
frequency depend on the analysis period, or record length. 
The column indices start out in the order of the input 
sequence for the first stage factor. They are then 
reordered by each transfer stage, in the same manner that 
the row indices are permuted in the DIT algorithm. Thus, 
the column indices, n, represent the points of the input 
sequence that reside ina given processor at any given stage. 
The transfer matrices are the same as for the DIT, and occur 


in the same order as the DIT with the same input ordering. 
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That is, the DIF with PN input has the same transfer 

pattern as the DIT with PN input. The pattern is the 

reverse of the DIT with BR input. Since the transfer 

matrices are symmetric, the transpose operation leaves 

them unchanged. Thus, the individual factors of the DIF 

with PN input/BR output are each the transpose, in reverse 

SvacryeotwememtLactors of the DIT with BR input/PN output. 

Since the ordering is reversed, the magnitudes of the 

transfer matrices' row labels are given by the frequency 

separation at the previous stage. Recall that the 

magnitude of the column labels on the transfer matrices for 

the DIT are given by the time separation for the next 

aaee wo oee ucure 2.15 and compare with Figure 2.4.) Similar 

duality exists between the DIF with BR input/PN output and 

the DIT using PN input/BR output. The He matrices are the 

transpose of one another, and the stage factors of the DIF 

are the transpose of the factors of the DIT, taken in 

PaweTeemondger. (See Figure 2.16 and compare with Figure 2.14.) 
This section has presented the factorization of 

FFT matrices, to obtain their multi-processor representations. 

Following sections will investigate the implementation of 


other fast transforms on the Butterfly Network. 
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8-Point/4-processor DIF FFT ( PNsnemaites 


Figure 2.15 
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8-Point/4-Processor DIF FFT (BR Input) 
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Figure 2.16 


III. OTHER FAST TRANSFORMS ON THE BUTTERFLY NETWORK 


This section uses the tools developed previously to 
investigate other fast transform algorithms for potential 
implementation on the Butterfly Network (BN). Section II 
used matrix representations to show that the radix-2 FFT 
algorithm is supported by the BN. That is, the radix-2 
FFT can be partitioned so that multiple processors can 
carry out the computation in parallel. Thet BN supports 
the required inter-processor transfers of partially- 
processed data, producing transformed output in anew, yet 
easily decipherable, order. This section examines several 
other algorithms to determine their adaptability to 
parallel processing on the BN. The analysis will focus 
on the required inter-processor transfer patterns, and 


the output ordering produced from the Butterfly Network. 


A. FAST HARTLEY TRANSFORM (FHT) 

Bracewell has articulated a real, fast transform 
algorithm that serves all the purposes of the complex 
Fourier transform [Ref. 11]. “Deets seae edu tne tase 
Hartley Transform (FHT) in honor of Hartley, whose 1942 
formulation of a real integral transformis the basis for 
the FHT. The FHT algorithm is faster than the FFT, and 
if desired, may even be used, with a little additional COSit. 


to compute the DFT. 
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As shown previously with the DFT, the discrete Hartley 
transform may be viewed as a Square matrix operating on an 
N-dimensional vector. The DHT matrix can then be factored 
to obtain the FHT algorithm where each factor represents 
one stage of the algorithm. The step from the Hartley 
transform to the Fourier transform can also be expressed 
aemaematcix multiplication and so the factorization of 
the Hartley matrix leads directly to a new factorization 
Beewthe Fourier matrix [Ref. 11: p. 64]. 

1. Single-Processor FHT (SPHFT) Algorithm 

The Discrete Harley Transform (DHT) is defined 


bRert. fl? p. 27] as the sum of the cosine and sine transforms: 


N-1l 
X(k) = < 2 an) Casiecamea Ny), k=O, les. : ,N-1 
n=0 
(3 > 
where cas@ = cos@ + sin? . The inverse transform is: 
N-l 
x(n) = = X(k) cas (27nk/N), n—OPl. «,.,N=1 (2379) 
k=0 


Note that only real arithmetic is used, and except for the 
PaGEOL y in (3.1), the forward and inverse transforms 
aeeewtdentitcal, ihus, for real input data, the DHT is also 
real valued, and the original sequence can be obtained by 
using the same transformation formula a second time. 
Equation (3.1) may be represented as a vector-matrix 


equation: 
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X(k) = H x(n) Cn) 


The DHT matrix H can be factored to give the FHT, similar 
to the way FFTs are obtained from the factorization of a 
DFT matrix. The input sequence is again assumed to have 
length equal to a power of 2. Thus, the DHT matrix H is 


factored as: 


= aie re ey ee (S58) 


where G = log.N. The factors Le: s=1,2,...,log.N are 
called stage matrices, and represent the computations at 
each Stage ot thewal sori thine ete ump se qtle i ccunaane mane 
in bit-reversed order, and yields normal-ordered output. 
Each of the factors in (3.4) are sparse matrices, 
contributing to the speed of the FHT algorithm. The FHT 
employs a "divide and conquer" strategy similar to the 
FFT. An N-point DHT is obtained as a linear combination 
of the outputs of two N/2-point DHTs, each in turn 
obtained as combinations of the outputs of two N/4-point 
DHTs, ete. The deconposi: 7onmaroeecd sume 7p oa ne 
sequences result. The DHT of a 2-point sequence is 
obtained simply as the sum and difference of the 2 data 
elements. To illustrate, applyancmtem@e run tt ono ele 


for N=2 yields: 
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X(0O) = [x(0)(cos O+tsin 0)+x(1)(cos O+sin 0)] 

X(1) = $[x(0)(cos O+tsin 0)+x(1)(cos7 +sin7 )] (3.5) 
RCO) = 3{x(0)+x(1) ] 

meet =a x(0)-x(1) | C3 .0)) 


iar matrix form: 


xX (0) 1 1 x(0) 
. - (3.7) 
X(1)I 1 —] ixC1) 


Reference 1!1 defines a general decomposition formula that 


to} 


produces the DHT by successive halving of the input 
Beqwenee., Given the input sequence x(n), n=0,1,...,N-l1, 

let the N-element subsequences {x (0) Om 2920 See ead 

eae) 0 x(3) 0 mee) fave DHS X, (k) and X4(k) respectively. 


Then, 


ee) = XK, (k) + Xj (k)cos(2rk/N)+X,(N-k)sin(27k/N) 
CS ee) 
Mipeemorrierure Corresponds to the factorization of the 
DHT matrix, H prec. | lsep. 12) |.. 
The second stages compute one or more 4-point 
DHTs, using two 2-point DHTs as inputs. The cosine and 


Sine multipliers will be either 1, 0, or -l, since 
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(2mnk/N) = (27nk/4) = (nk7/2), and only integer multiples 
of w/2 are thus possible. Applying (3.1) for N=4 and 


puttihe the results@in matrix egoummeoi1y ese 


OD 1 1 1 1 x(0)| 
Feil 1 ey = | Se Gp) 
= (227900) 
X(2) lee il eee x(2) 
| 
x3) J} -1 0 =1 1 GQ) 


The columns are rearranged to correspond to bit~reversed 


Tipe, yy 1 ec ledemaies 





X¥(0) 1 l x( 0) 
Kei) ~1 1 x( 2) 
= (22.503) 
KC ih yal 
X(3) <j x (3) 


Equation (3.10) can be factored] Sinecer halt tEheweceums can 
be computed as 2-pornte won leas mentioned above. This 


Fac Cordagat ion Vac die 





i 1 1 1 x (0) 
1 1 1 ~1 S29) 
1 -] 1 i x(1) 
1 ~] 1 ~]1 x(3) 
= L Ly x(n) 63.5 118) 
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iter smeastly verified by carrying out the matrix 
Divalent ion that (€3.11)"is equivalent to (3.10). 
(In this development, and in the future, the nv factor 
is neglected for clarity. It can be inserted at the 
end, if necessary for normalization.) 

The third and subsequent stage factors, representing 
8- (or more) element FHT stage matrices are more complicated. 
the mepwits to the 8-point FHT are the outputs of two 
4-point FHTs. This recurSive property for generating 
higher-order FHTs from the combination of lower-order 
FHTs never changes. What does change is the number of 
non-zero elements per row in the stage matrices. For the 
third and subsequent stages, there are three non-zero 
elements in each row. This means that three elements 
enter, with appropriate coefficients, into each output 


element. The third stage factor is: 


1 Co 
l oH © 
l iS 
L, = a 53 C (GR2) 
1 Gy 
1 Ge S 
1 Ke 
1 3. C 
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where CT = €os( 2m. = = sin(27n/8) and KT = 
[cos(27n/8)+sin(27n/8)]. Note that the cosine elements 
and the 1's are on diagonals parallel to the main diagonal, 
but the sane elements ynun the othe rw) eo a) Ge staie 

the independent variable of the operand element that 
multiplies the sine coefficient Das decreases as n 
increases. This property is referred to as retrograde 
indexing [Ref. 11: p. 73]. The arguments of the sine and 
cosine terms are integer multiples of 7/4, taking on a 
numerical value of either, 0, +1, or + 1/¥V 2. Pte teens 
numerical values in the b. matrix and carrying out the 
matrix multiplication gives the result shown in Figure 3.1. 
Again it is easy to verify that the product in Figure 3.1 
gives the same result as the DHT definition of (3.1). 
Conversion of the DHT to the DFT is accomplished simply 
by associating the even and odd parts of theDHT with the 


real and imaginary parts of the DFT. For a discussion of 


the mechanics of this operation, see reference ll, pp. 75-/7 


and reference 12, p. 150. However, if the power spectrum 
is desired, it can be computed directly from the DHT by 
evaluating [X(k) ]7+[X(N-1) ]*. Finally, note that since 
the FHT is a real transform compared to FFT, which 

is complex, the real and imaginary parts of any 

complex input data can be processed in parallel, 


respectively, with the FHT [Ref. 12: p. 147]. 
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8-Point/1l-Processor DIT FHT (BR Input) 


Figume 3.1 
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2. Multi-Processor FHT (MPFHT) Algorithm 

The Fast Hartley Transform can be partitioned among 
several processors for parallel implementation on the 
Butterfly Network. The number of points per processor, 
M=N/P, must be greater than or equal to 4, since the 
third and successive computation stages (N>8) combine 
3 elements at a time. Thus, there are at least two 
(log.M) pre-transfer stages for any N-point FHT. The 
recursive nature of the FHT yields inter-processor transfer 
requirements similar to the MPFFT. That is, after 
L“=logoM pre-transfer stages, there are transfer stages 
interspersed between the subsequent post-transfer computation 
stages. As always, the transfer stages permute the row 
ordering, and thus the ordering of the output sequence. 
Processors again retain half their data, and transfer half 
at any given stage. When more than one transfer stage is 
required (4 or more active processors), the patterns of 
the output ordering are apparently not so convenient as in 
the MPFFT. More desirable patterns than those to be 
“presented may well exist. The difficulty lies with the 
fact that the amount of data transferred is always a power 
of 2 (one-half each processor's memory), while 3 points 
are combined in the third and successive computation stages. 

The easiest case to consider is an 8-point FHT 


implemented using 2 processors. There are 3 (log.8) 
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computation stages: 2 pre-transfer computation stages 
(L* = log, (N/P)=2), Petranstenreceagc (log,P=1) and 


1 post-transfer computation stage. For this case: 
K(k) = L, Ty bo by x(n) (3.14) 


The product is computed in Figure 3.2 and can be compared 
with Figure 3.1. The transfer matrix is symmetric, and 
was chosen in keeping with the invertability property of 
the DHT. Note that this transfer pattern causes bit- 
reversed output, when the input is also in bit-reversed 
order. Again, exactly one-half of each processor's memory 
is exchanged. This time, however processors exchange 
alternating memory locations, instead of the upper- or 
lower~half as in the MPFFT. Processor P retains its 
even-~labelled memory locations, and transfers its odd- 
labelled memory locations in exchange for processor P,'s 
even-numbered data. The order of computation after 
transfer is somewhat subjective, and was chosen to minimize 
the amount of buffering required within each processor. 
In this case, the requirement of only 2 additional memory 
Hoc@enons per processor for buffering a@ppears to be 
optimum. 

Extending the discussion to N=16 illustrates the 
features of the generalized multi-processor FHT. For a 
16-point data sequence transformed using the same 2 


Brocessors, the vector-matrix equation extends to: 
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8-Point/2-Processor DIT FHT (BR Input) 


Figure 3.2 
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eee ees ep by 260) oe 


The recursive property of the FHT keeps the first 3 stage 
iaerices Ly; Los and Le CoG eo Meenee same “with their 
patterns repeating along the main diagonal. Thus, only 
the transter stage matrix 1 and the final 16-point 
computation stage Ly need be discussed. The transfer 
pattern is chosen again so that a 1s symmetric. The 
symmetry property, along with the property of only one 
non-zero element (whose value is unity) per row and 
column, makes the transfer matrices presented herein 
orthogonal. Processor Py once again transfers its odd- 
Mabelled memory locations (this time, 4 elements) to Pas 
in exchange for es even-numbered element. Themor den 
of computation obviously determines output ordering. 
This will be discussed further in the next section. 
Matrices Qi and L, are shown, as is their product, in 
Meewrero.5. tthe fourth stage compuation matrix Ly Oi 
eae wsamele—processor product L . QT 1s equivalent to the 
single-processor Ly Matrix with reordered rows. Since, 
for any given row, the column entries are thesame 
both implementations perform the same computation, 
although the ordering of their outputs is different. 
Consider now the 16-point FHT implemented on a 


4=processor Butterfly Network. This illustrates the 


"naximum transfers" case for the FHT since more than 3 
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points per processor are required. This is not really a 
problem, since the multi-processor's advantage lies in 
the computations that can be done in parallel, whether 
prior to or after the required inter-processor transfers. 
Reference 1l also contains a program called FASTPERMUTE 
that speeds up the required reordering of the input 
sequence, from normal to bit-reversed order. This thesis 
assumes that a bit-reversed sequence is made available 
as the input to the multi-processor algorithms presented. 
The vector-matrix equation for the 16-point/ 


e=peocessor MPFHT is: 


_ f 
K(k) = L, Ty by £47, bo by xa) (3.17) 


where part of the fourth computation stage is performed 
prior to the second transfer, for reasons to be explained 
below. The pre-transfer stage matrices, L, and Lo are 
identical to the single-processor versions, and duplicate 
the patterns of the 2~- and 4-point FHTs as submatrices 
along the main diagonal. The first transfer stage qT, 
occurs at the same place as, and duplicates the pattern 

of, the 8-point/2-processor MPFHT. This should be 

expected, since both have the same value of L¥=log,M=log.(N/P), 
and it is this value that drives inter-processor transfer 
requirements. The first post-transfer computation stage 


is also identical to the 8-point/2-processor case, again 


with the element pattern repeated along the main diagonal. 
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These patterns were chosen in keeping with the recursive 
property of the single-processor FHT algorithm. That is, 
an 8-point FHT can be made up from the outputs of two 
4-point FHTs, each in turn derived from the outputs of 
two 2-point FHTs. This property is responsible for the 
lower-order computation matrices repeating as 
submatrices along the diagonal in the early stages of 
higher-order FHTs. 

The fact that the FHT combines 3 elements into 
one output element at the third and successive computation 
Stages causes problems when more than 2 processors are 
used. Recall that the number of active processors, P, 
sets the number of inter-processor transfer stages (log,P). 
Thus, for 4 or more active processors, there are 2 or more 
transfer stages. Since the transfers occur after all the 
pre-transfer computations (for N=16 or a higher power 
of 2) the transfers occur between stages involving 3 
points in the computations. Therefore, an inherent 
incompatibility exists between the transfers, involving 
a number of points equal to a power of 2 (one-half of 
each processor's local memory), and calculations involving 
3 elements. Figure 3.4 illustrates that all elements that 
are other than 1 (multiples of sin(7/8 and cos(7/8)) fall 
into the right-half of matrix L, (columns 8-15). The 


non-unity coefficients multiply elements x>(8) through 


OCS). which are outputs from see ae Ot Note in L. 
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Figure 3.4 Fourth Stage of 16-Point/1-Processor FHT 
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of Figure 3.5 that all these elemenrs occur ine 2 gees some 
Ps and P.. Thus, one way around the problem is to 
partially pre-compute the fourth stage prior to transfer. 
After the inter-processor transfer, stage Ly Finishes the 
16-point MPFHT computation, using only 2 elements in the 
cakculation of Geach outputepol nue 

Note that the partial prescalculation (smece Lis 
in Figure 3.5 and Equation 3.17) leaves processors Poy and 
Py idle at that stage. This inefficiency pays dividends 
in thes final stase 4 computainen Cle since this stage 
now only involves sums and differences of 2 elements. No 
synchronization problems should be generated, since the 
next stage is a transfer stage, involving handshaking. 
Partial precomputaton is only necessary when 2 or more 
transfers are required, and arises in part due to the 
decisions made involving patterns of the first transfer 
and order of computation of ENGRRinepostE-Eranecter 
computation stage. The products of the third and fourth 
stages of the 16-point/4-processor MPFHT are shown in 
Figure 3.6. It can be verified by comparing Figure 3.6a 
with Figure 3.4 that the multi-processor implementation 
performs identical computations as the single-processor 
algorithm. The output is reordered an the multi-processor, 


version, as indicated by the row reordering caused by 


transfer r5- The columns of the product Ly “ : 
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are reordered as well. This reflects the fact that the 
input to the partial pre-computation stage (L,) comes from 
the output of stage L.. This sequence was permuted by 
the first transfer in the 4-processor implementation, Ti: 
This is verified by comparing Figure 3.6b with Figure 3.1, 
which shows identical elements in any given row, with the 
rows of Figure 3.6b reordered due to transfer Ti: 
Figure 3.6b can also be compared with Figure 3.2. Note 
that each factor in Figure 3.6b is a duplicate of the 
factors in Figure 3.2, as is the final product. Thus, 
the MPFHT retains the recursive property of the FHT, in 
that the initial stages of higher-order FHTs are duplicates 
of the outputs of lower-order FHTs, as desired. 
3. Transfer Patterns and Output Ordering 

The single-processor FHT (SPFHT) accepts bit- 
reversed input, and generates a normal-ordered output 
sequence. The multi-processor FHT (MPFHT) implemented 
on the Butterfly Network also accepts input in bit-reversed 
order. The transformed output Sequence, however, is 
permuted by the inter-processor transfers required. The 
permuted output is not the pseudo-normal (PN) ordering 
characteristic of multi-processor FFTs, for reasons that 
will be explained shortly. Normal-ordered output may 


again be obtained by additional network transfers, or by 


cyclically reading the processors' local memories. 
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Recall the earlier statement that the "maximum 
transfers" case of the MPFFT won't work with the MPFHT. 
That is, 4 or more points must be contained in each 
processor. The situation involving the maximum number 
of transfer stages in the MPFFT occurs when only 2 points 
are contained in each processor. Thus, the first transfer 
of the "maximum transfers" MPFFT case (L¥*=log.M=1) is 
effectively bypassed in the MPFHT "maximum transfers" 
case (L*=2). With this in mind, the algorithm of 
Section II1.D.2 for generating transfer matrices can be 
modified to generate the MPFHT transfer patterns. The 
modification to the “bit=switcheme: maloorithmon 
Section II.D.2 is necessary because of which memory 
locations are involved in the MPFHT transfers. Recall 
that the MPFFT transfers involved the upper- or lower—hanle 
of a processor's local memory. The MPFHT algorithm 
transfers the even- or odd-numbered local memory 
locatons, instead of adjacent upper- or lower-half 
memory locations. This all relates to the properties 
of counting in binary. That is, every even decimal 
number has a binary representation ending in 0, while 
every odd decimal number's last bit is 1 when expressed 
in binary. The modified algorithm for generating MPFHT 


transfer matrices is described below. 


mez 


As in the MPFFT case, the MPFHT transfer matrices 
are viewed as permutations of NxN identity matrices. 
This procedure determines the column indices where 1's 
appear, by switchingappropriate bits in the binary 
representation of the row indices. The bits being 
Swatched =ditter from those of the MPFFT transfer matrix 
acon ithmeume Acain, this is die to the fact that the Ralf 
SinMcaci processor S Memory tramsterring is now based on 
even- and odd-numbered local memory locations. Not 
surprisingly then, one of the bits to be switched is 
always the last, or Oth bit, since this bit determines 
even or odd. “(Bits are again labelled by bp izes bob ba, 
where L=(log.N)-1.) Tie ERESteenanster Stage Swaps bit 
by ween Dat b,*, where L¥*=log M=log.(N/P). Successive 


transfer stage switch bo with b then with b 


lees | Raye 

etc. For example, in the 16-point/4-processor 
implementation of Figure 3.5, log,N=4 and L*=2. Thus, the 
binary row indices have bits labelled b,b,b,bo- The 

fiewest Stage transfer matrix yi determines column 

position of elements by switching bo with bo in the row 
indices. The second transfer matrix T, switches bo and b,. 
Source and destination processors are identical to the 
MPFFT case, and can be calculated using the cube(i_1) 


functions to find a destination, given a source. 


Additionally, at the ith transfer stage, the (i-l)st 
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bit of the processor's binary address determines whether 
a processor transfers its even or odd memory locations. 
If the (i-l)st bit is zero, the processor transfers its 
odd-numbered memory locations, and if the bit is one, 
the processor transfers even-numbered memory locations. 
As in the MPFFT with "upper-half" and "lower-half" 
processors, those processor's transferring their odd- 
numbered memory locations receive the even-numbered 
locations of another processor. This convenient 
property again makes all transfer matrices symmetric 
and orthogonal. See Figure 3.7/7 for a graphical display 
of this example. 

Discussion of the reordering of the output 
sequence can be simplified by distinguishing between bit 
postions and bit labels. Let the normal-ordered output 
sequence have binary representation with bit labels 


b,b5b, bo; occupying bit locations (positions) b,b.5b D 


1°0O 


(N=16). Now the algorithm just. discussed generates 
column indices from row indices, when applied to bit 
locations. That is, each transfer stage starts anew 
assuming bit locations bbb, by for row indices, then 
switches the two appropriate bits for that transfer stage 
to give column positions where l's occur. Since the 


transfer stage permutes the output ordering cumulatively, 


the bit labels are not restarted at successive transfer 
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stages. For example, the 16-point/4-processor case begins 


with bit labels bab5b,) by in bit locations bzb5b,b,- The 


first transfer xy Switches DY and Do to give bit labels 
(output sequence) and bit locations (column indices) 


bD,b,b,b..- The second transfer I, again starts with row 


ordering in bit locations D,b,b, by, while the bit labels 


are left partially-reordered (bgb)b,b.). Then, switching 


925,53, while 


the output sequence is Dob3b,b.- When the bits of the 


Do and De gives column indices ordered b 


respective binary representations are permuted according 
to this rule, the correct column positions and output 
sequence order result. This procedure was used when 
labelling the figures in this section. The column 

indices correspond to input ordering at a given stage, and 
row indices give ae tui ordering from that stage. 
Therefore, the row labels of the final computation stage 
give output ordering in all cases. 

To summarize, the Fast Hartley Transform is 
realizable on the multi-processor Buttrerfly Network. 
Several pertinent features of the implementation have been 
presented, but details of the features and properties of 
this new algorithm have of necessity been only partially 
covered. The interested reader is Peperied to reference 1l 
for a more detailed discussion of this algorithm that has 


great potential for Signalepmeece sc ince ete bower 
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B. FAST WALSH-HADAMARD TRANSFORM (WHT) 

While the DFT uses sinusoids of harmonic frequencies 
aeeresSeDasis functions, other transforms can be defined 
using other bases. The Walsh functions, named for 
Joule Walsh who introduced them in 1923, are the basis 
functions for the Walsh-Hadamard transform (WHT) 
fetewzcepemool|. Walsh functions can be generated 
recurSively, are orthogonal, and form a closed set. 

Since they are square waves, Walsh functions take on 

only two values, namely +1 or -1 [Ref. 13: p. 225]. 

This property enables the WHT to be calculated using real 
number operations, involving only additions and 
subtractions. This section presents two of the several 
possible ordering schemes for calculating the WHT. 

After introduction of the single-processor implementation, 
the WHT will be examined for compatibility with the 
multi-processor Butterfly Network. Again, the emphasis 
will be on inter-processor transfer requirements and 
output ordering. 

1. Single-Processor Walsh Ordered Transform (WHT)w 

Walsh functions can be rearranged to form 
different ordering schemes such as Walsh or sSequency 
order, Hadamard or natural order(shown in Figure 3.8), 
Paley or dyadic order, and cal-sal order [Ref. 2: p. 303]. 


This section examines the transform derived from the 
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Figure 3.8a Walsh Functions in Walsh or Sequency Order 


[Ref. 2: p. 304] 
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Figure 3.8b Walsh Functions in Hadamard or Natural Order 
(Ref. 2: p.305] 
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Walsh functions are described in terms of sequency, 
which combines the number of sign changes and frequency. 
Sequency is defined as the number of sign changes per unit 
interval: 

Os) een 7 oC. 


sequency = (3. fer 
$(Z.C.+1) odds. Cz 


where Z.C. is the average number of zero crossings per 
Unit amtervalepeen., 2: peeotour 

Sampling of the Walsh functions at equally spaced 
sample points generates the Walsh-Hadamard matrices in 
corresponding ordering. Periodic sampling of the 
Walsh functions in sequency order at 8 equi-spaced 


sample points gives: 


2 3 4 5) 6 7 Sequency 





1 i l il ii 0 
1 1 -l -l -l il 
-l -l -l -l il 1 
-l -] 1 1 -l 2 
[Hw(3)] = 
-l 1 1 -l -] zy 
-l 1 —] 1 I 3 
1 -l -l 1 -l 5 
l -] - -] 1 4 
C3, 195 


where {[Hw(3)] is thus the 2722” Walsh-Hadamard marix in 
Walsh or sequency order. The rows represent Walsh 
functions whose sequencies are listed. Whereas te 
defines the DFT matrix for N=275, [Hw(G)] defines the 
(uni yw matrix |Ref. 2: p. 310]. The CWHT)w is thus 


defined as: 
X(k) = (1/N)[Hw(G) ]x(n) (3.20) 


while the input data sequence can be recovered from the 


inverse transform: 
xn) = [Hw (Gp |X(k) (3221) 


The same algorithm can be used, then, to generate the 
transform or its inverse, except for the (1/N) factor. 

One fast algorithm that results from the marix factorization 
Seow) 1s shewn in Fisure 3.9. This factorization 
reduces computation requirments from Nn? to NlogoN additions 
and subtractions, where N is again the length of the input 
sequence. The algorithm represented in Figure 3.9 accepts 
bit-reversed input, and generates a transformed output 
Sequence in normal order. Note also that this algorithm 
can be computed in-place, thus minimizing extra memory 
required for buffering. The factored-matrix equation 
corresponding to Figure 3.9 is shown in Figure 3.10. 
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Figure 3.10 


LaAles 


computation stage. The final precuec. W > is equivalent 
to [Hw(3)] in (3.19), with the columns reordered to 
correspond to the bit-reversed input. The rows of W 
are in the normal order in which the output appears. 

2. Multi-Processor Walsh Ordered Transform (MPWHT)w 

The Walsh or sequency ordered transform (WHT)w 

can be partitioned among several processors for 
implementation on the Butterfly Network. An alternate 
(equivalent) signal flowgraph to enable partitioning is 
shown in Figure 3.1l. It is easily verified that the 
flowgraph of Figure 3.11 represents identical computations 
to those of Figure 3.9. Note that multi-processor 
implementation of Figure 3.9 directly is not possible 
due to the ordering of points pare rcipating in Bacwei rset 
butterfly stage. The flowgraph of Figure 3.11 allows 
points participating in first stage butterfly computations 
to be input to the same processors when implemented using 
more than one processor. The ‘apaue data is normal- 
ordered, however, and the algorithm produces bit-reversed 
Output. Thus the WHT matrix W is the transpose of the 
W matrix of SeliewER input/normal-ordered output case. 
The stages perform the same computations, but in 
different order. Thus the matrix factorization represented 
by the signal flowgraph of Figure 3.11 represents an 


alternate factorization of "thew whP net. This 
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alternate factorization is shown in Figure 3.12, and 
is the basis for the multi-processor implementation of 
the WHT described in the following paragraphs. 

The factorization of (3.23) shown in Figure 3.12 
allows the multi-processor Walsh ordered transform 
(MPWHT)w to parallel the development of the MPFFT 
presented in Section II. The length of the input sequence 
and the number of active processor are again assumed to 
be powers of 2. There are log.N computation matrix factomes 
each representing a butterfly computation stage. 

(N=input record length.) These computation matrices are 
again divided into L*=logoM pre-transfer stages and 
log.P post-transfer stages, with log.P inter-processor 
transfer stages. (P=number of active processors, 
M=N/P=number of points in each processor.) The 

"maximum transfers" case again occurs when L*=1 and 

each processor contains only 2 data points. For the 
8-point WHT on 4 processors, the vector-matrix equation 


factors as: 


SOS ie W3 T5 Wo zy My x(n) (3.24) 


where the W matrices represent butterfly stages, the T 
matrices describe transfer stages, x(n) is the input 
sequence, and X(k) is the transformed output sequence. 


The factorization of (3.24) is shown in Figure 3.13 and 
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represents the multi-processor implementation of (3.23) 
from Figure 3.12. Note the effect of the multi-processor 
implementation on the ordering of the output sequence. 
The single-processor WHTw produces bit-reversed output 
when the input is normal-ordered (Figure 3.12). The 
transfers in the multi-processor algorithm reorder the 
output to pseudo-bit-reversed (PBR) order. Note also 
that the transfer matrices are the same as the 8-point/ 
4-processor MPFFT with BR input (Figure 2.4). This 
convenient property means that the algorithm of 
Section II.D.4, which transforms normal to pseudo-normal 
ordering, may be used to permute BR (single-processor 
output) into PBR ordering. A slight modification must be 
made to account for BR order as the starting point, 
rather than normal ordering. Whereas the bit positions 
are labelled Dob, bo for the 8-point normal-to-PN 
reordering algorithm, the bit positions must be 
relabelled Dob, bo to correspond to BR-to-PBR reordering. 
This is the same idea as shown in Table 1, and results 
in the appropraite reordering when the "bit-switching" 
algorithm is applied. 
Just as with the MPFFT, the MPWHTw can also 

be implemented using 2 processors. In this case the 
equation is: 

X(k) = Ws 7, Wo W, x) (3.26) 
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and, as in the MPFFT, there are 2 pre-transfer butterfly 
stages, a transfer stage, and 1 post-transfer butterfly 
stage. Transfer matrix Ti is again identical to r1 of 
the 8-point/2-processor DIT MPFFT. The output is again 
in PBR order, though different from the 8-point/4-processor 
PBR order. Since there is only one transfer in the 
2-processor case, its output is less-reordered from the 
single-processor BR ordering. This is analogous to the 
PN ordering of the MPFFT, which was shown to also be a 
function of the number of transfers required. 

The MPWHTw may also be implemented using PBR input, 
to produce normal-ordered output. Using the signal 
flowgraph of Figure 3.11, now with PBR-ordered input, 


the 8-point/4-processor MPWHTw matrix factors as: 
= ’ t 
EN) > Boo ee Ce) 


where es of the normal input algorithm and = h 
of (3824) and Figure 342 The Wak. matrix W is the 
transpose of the product W from Figure 3.13. The 
individual stage factors are different, however. They 
represent the same computations at each stage, but in a 
different order, since the input ordering is changed. 
These ideas are illustrated in Figaee 32:14... © ire 
Situation is slightly more complicated when only 2 


processors, requiring only 1 transfer stage, are used 
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with a PBR-ordered input sequence. The vector-matrix 


equation in this case is: 


K(k) = Wy Wy Zp Wy x(n) (3.29) 
where the output is again produced in normal order. The 


transfer matrix T is identical to Ty of the 8-point/ 
2-processor MPWHTw with normal-ordered input, but occurs 
after only one pre-transfer butterfly stage. The 
product of (3.29) is again the transpose of the W matrix 
of the normal input case, and the same calculations are 
performed at each stage, again in different order. Thus, 
for PBR input, there are log.,P butterfly stages with 
log.,P transfers between each, then logo" post-transfer 
butterfly stages. The Walsh ordered transform is easily 
implementable on the multi-processor Butterfly Network, 
and exhibits many of the same properties as the multi- 
processor FFT. It can be used with normal-ordered input, 
to yield pseudo-bit-reversed output, or vice versa. 

32 Single-Processor Hadamard Ordered Transform (WHT)h 

Walsh functions can be rearranged to form 

Hadamard or natural ordering. The wear aise sampling of 
these Hadamard-ordered Walsh functions yields Hadamard 
matrices, which can be generated Ae te aL Ge [Ref. 2: p. Sige 


13: p. 226 | Vase to itliows': 
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[H,(0)] = [1] 
[H, (k+1) ] = LH, (k) ] LH, (k) | k=0,1,...,log.N 
[Hp (k)] 4H, Ce) J 
C3) 5 SO) 
Meesiadamard ordered transform is similar to (3.20), and 


is defined as: 

X(k) = (1/N)(H, (6) x(n) (3.31a) 
where G=log.N. The inverse (WHT)h is defined as: 

x(n) = [H,(G) ]X(k) Gres ib) 


The Hadamard ordered transform is also referred to as the 
Demers Fourier representation (BEFORE) transform. Fast 
algorithms can also be developed for the (WHT)h based on 
[eer ix factoOrizations. The Hadamard matrix for an 8-point 


transform can be formed by 3 successive applications of 


(3 30): 
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This matrix can be factored into 3(log,N) MatGixreraerors. 
each representing a computation stage of a fast WHTh 
algorithm. One particular factorization, using normal 
input and producing normal output, 1s shown in Fieure oo lo- 
4. Multi-processor Hadamard Ordered Transform (MPWHT)w 
Another method of factoring (3.32) is shown in 
Figure 3.16. Again, this factorization is not unique, 
but allows for efficient implementation of (3.32). The 
corresponding single-processor signal flowgraph is shown in 
Figure 3.1/7. This factorization accepts bit-reversed input, 
and generates bit-reversed output. This method is 
compatible with a multi-processor implementation, in 
that points participating in the first butterfly stage 
are adjacent in the input sequence. It is easy to verify 
that W from Figure 3.16 is equivalent to (35325 7 wasn 
rows and columns reordered to reflect the changes in 
output and input ordering, respectively. 
The algorithm ef (3.34) a@ndMeaecure 3717 can be 
partitioned for implementation on the multi-processor 
Butterfly Network. For an 8-point WHTh implemented using 


4 processors, the vector-matrix equation is: 


XR) = Wao fo ie oreo ee | (3.35) 


where the input sequence, x(n), is in BR ordering. The 


Output is permuted into PBR order by the necessary 
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inter-processor transfers. The matrix product of (3.35) 
is carried out in Figure 3.18. Note that the transfer 
matrices T, and ro are again identical to the 8-point/ 
4-processor DIT MPFFT with BR input. The product W is 
equivalent to the W matrix of the WHTh single-processor 
implementation with rows reordered, reflecting the 
reordering of the output. 

An 8-point WHTh implemented on 2 processors can 


be expressed as: 
T, Wo Ny 2) (322.09) 


where T, is the same as ry of the Walsh-ordered MPWHT, 
again identical to T, of the 8-point/2-processor DIT 
MPFFT with BR input. In this case also, the output is in 
PBR order, as would be expected. Thus, the algorithms 
from Section II for transfer matrix generation and output 
reordering may be used in this case also. 

Another alternative factoeri2zaeuon of the 
Hadamard transform is to use pseudo-normal input data. 
This option is desirable, since the output is produced 
in normal order, as the Walsh ordered transform produces 
normal output when the input sequence is in PBR order. 
The 8-point/4-processor MPWHTw using PN input is 


expressed as: 


X(k) = We T W, T 


f t . 
De ee x(n) (3.37) 
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8-Point/4-Processor Fast WHTh (BR Input) 


Figure 3.18 


129 


and the product is carried out im Pueure 3°19) swede 
to the Walsh ordered transform when PBR input is used, 
the transfers of (3.37) occur in reverse order from 


the BR input case of (3.35) ei eeeecr 


ae of G3237) ais 


iden tac me T> One (3.35 aes areT.' and Ti When fewer 
processors are used, the transfer matrices are also the 
same, and also occur in reverse order. This has the 
effect of reversing the relationship between pre- and 
post-transfer butterfly stages. Now there are log.P 
butterfly and transfer stages, followed by logoM 
post-transfer butterfly stages with no transfers between 
the post-transfer butterflies. 

In this section all normalizing factors of (1/N) 

r 
for the forward transforms have been omitted to allow 
concentration on thetransfer patterns and reordering 
of output sequences. They are easily included at the 
end, if necessary. 

The ideas presented in this section are easily 
extended to larger data lengths and more processors as 
follows. For input record length N, there are always 
logoN total butterfly stages. For P active 
processors, there are log.P butterfly stages with inter- 
processor transfer stages between each. These occur after 


L¥=logoM pre-transfer butterfly stages (BR input), or 


prior to L* butterfly stages (PN input). The transfer 
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8-Point/4-Processor Fast WHTh (PN Input) 


Figure 3.19 
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patterns and output reordering for the Walsh transform 
(Walsh or Hadamard order) parallel those of the multi- 
processor FFT. Thus, algorithms for FFT transfer 
matrix generation are applicable to the corresponding 
WHT transfer patterns. Reordering of output sequences 
from the single-processor WHT are likewise obtainable 
from the corresponding MPFT algorithm. 

Finally, the Walsh transform is very compatible 
with the multi-processor Butterfly Network. Both Walsh 
and Hadamard orderings have many of the same properties as 
the MPFFT when implemented on the BN. Thus, ideas from 


Section II are also applicable to the WHT. 
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iV eo Ee PNED ALGORITHM POR TRANSFER MATRIX 
Geer a LEON AND PHUELE-PROCESSOR OUTPUT 
ORDERING 
The algorithms presented in Section II for generating 
transfer matrices and output reordering can be combined into 
a single algorithm. To do this, a distinction must be 


drawn between column labels and output labels. Recall 


the algorithm of Section II.D.2 for generating transfer 


matrices as permutations of identity matrices. In an identity 


ive on) = OCcGUEeOonlty in those positions where the row and 
column indices are the same. For the chosen transfer 
patterns on the Butterfly Network, processors transfer 
one~half of their data at any given transfer stage. Thus, 
half the elements move off the main diagonal, and half 
remain. The algorithm for generating transfer matrices 
gives the column indices where 1's appear, as a permutation 
of the binary representation of the (normal-ordered) row 
indices. For half the row indices, this involves switching 
a l anda O, and the corresponding element moves off the 
main diagonal (representing transfer of a data point). For 
the other half of the row indices, the algorithm switches a 
l1 and al, or a O and a O, and the corresponding elements’ 
row and column indices remain equal. These elements remain 


on the main diagonal, and represent data points not 
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transferring at that stage. The transfer marices are 
independent, in the sense that each successive transfer 
matrix is generated by permuting the normal-ordered row 
indices to obtain the e~olumns where 2 SWamgeaneat that 
stage. 

The effect of the transfers on the output ordering is 
cumulative, however. Recall that the multi-processor fast 
transform output can be viewed as a reordering of the 
corresponding single-processor output. For example, a 
single-processor DIF FFT with bit-reversed input produces 
normal-ordered output. Section II.C showed that the DIT 
multi-processor FFT output is produced in pseudo-normal 
order. Recall also”that pseudo-normal order is avinnetrom 
of the number of active processors, P, hence the number of 
transfer stages. That is, the pseudo-normal ordering for 
a 16-point/8-processor DIT FFT is different from that of 
a l6-point/4-processor DIT FFT. In the first case, 
3(log.,8) transfers are necessary, while the 4-processor 
case only requires 2 transfer stages. Thus, the pseudo- 
normal ordering of the 8-processor case is permuted 
"farther" in some sense from the single-processor 
normal-ordered output. 

With the above background in ailing. the algorithm for 
generating transfer matrices is extended to also give 
Output reordering as follows. A distinction is drawn 


between column indices and output ordering indices. 
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These are both represented as binary numbers, and then 
Deumuted according to the algorithm of Section.II.D.2. 
Since the column indices are generated from the normal- 
ordered row indices of each transfer stage, normal-order 
is the starting point for generating each transfer matrix. 
The output ordering begins in the appropriate single- 
processor output ordering for the fast transform being 
computed, then is cumulatively permuted by successive 
transfers. 

several examples to illustrate the combined algorithm 
are shown in Table 2. This table illustrates the transfer 
matrix/output reordering of the DIT multi-processor FFT. 
Recall that the single-processor DIT FFT with bit-reversed 
input produces normal-ordered output. Likewise, for 
normal-ordered input, the single-processor DIT FFT generates 
output in bit-reversed order. These orderings are the 
Starting points for the output reorderings shown in 
Table 2. Note that the bit-switching algorithm is applied 
literally to the output indices' bit labels, regardless of 
what position the bit labels occupy at a given stage. Also 
note that due to the cumulative effect of transfers on the 
output ordering, the first transfer of any example is the 
only time the column indices and the output indices are the 
same. It is also obvious from Table 2 that pseudo-normal 
and pseudo-bit-reversed ordering depend on the number of 


transfers necessary. The first transfer stage of the 
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TABLE 2: COMBINED ALGORITHM FOR GENERATING TRANSFER MATRICES 


AND OUTPUT ORDERING (DIT MPFFT) 


16 Points/8 Processors: 


Transfer Matrix 


Row Indices Coiumn Indices 
Pare 1-6 

Ty (bo - by) bg ba by by 

To (bo - bQ) bgbgby bs 

T3 (bo - ba) bo b2bj bg 


16 Points/4 Processors: 


Transfer Matrix 


Row Indices Column Indices 
bg bob, by 

T, (by ~ bo) bab, baby 

T2 (by - ba) by bo b3gbo 


16 Points/2 Processors: 
Transfer Matrix 


Row Indices Column Indices 


b3bg by bo 


bea baba. 


RACE ak 2 seizo 


8 Points/4 Processors: 


Transfer Matrix 


Row Indices Column Indices 


baby bo 
Ti) Kou BS boboby 
T2 (bo - bg) this 


8 Points/2 Processors: 


Transfer Matrix 
Row Indices Column Indices 
PP piace 


Ts (by bo! bo) byboby 


Output Ordering 


Normal: b3 55, by 


Tg (bo - bg) bgbgbob, 
T3 (bo - 53) bobabob, 


PN: bobgbob, 


Output Ordering 
Norman babob, by 


Ty (61 a bo) bab, bobo 
Tz (b1 - bg) bybgbobo 


Output Ordering 
Normal: babo5by bo 


Ty (b2 - bg) bgbgby by 


PN: bobby by 


Output Ordering 


Normal : boby by 


Ty (Bg - by) bobby 
T92 (Bo — ba) Bobo by 


Output Ordering) 
Normal: bob,bo 


0 
PN: b, b5 > 
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Tz (bo - by) by bobybs 

? 
Tg (bg - ba) 61 bgbob3g 
Ty'(bo - b3) by bob3bo 
PBR: by baobab, 


BR: bob, bobs 


Tg (by - bo) bobgb1b3 
Tj (by - b3) bgbobsby 


PB: bob, bab, 


BR: bob bobs 


p 
BR: boby,b5 
Tj (by - by) bybobe 
Ti (bo - bo) 519% 
Br: bob, be 
’ 
Ty (by - by) bobby 
PBR: bobob, 


l6-point/8-processor example is not required when using only 
4 or 2 processors, and is effectively bypassed in the latter 
two cases. It can also be verified, by comparing Table 2 
with Table 1, that the output ordering produced is the same 
when using the combined algorithm. 

Care must be taken in applying this algorithm, depending 
on the type of fast transform being computed. The algorithm 
from Section I[I.D.2 and generalized here, was developed 
for DIT FFT transfer patterns. It can be used, with slight 
modifications to the sequence of bit-switching, for all 
other fast transforms presented herein. The modifications 
are discussed in the sections presenting the various other 


Fast transforms. 


Por, 


V. CONCLUSIONS 


Matrix factorization is a useful method of representing 
distributed fast transforms. While not suggested as a 
method of actually computing the transtorms, this 
representation provides insight into which points combine 
during computation stages, and into the transfer patterns of 
the Butterfly Network. Additional insight is gained in 
terms of the output reordering caused by the inter-processor 
data transfers. 

Matrix partitioning and factorization have been widely 
used to describe single-processor fast transform 
algorithms. This method was shown to also be useful in 
describing distributed fast transforms. Additional matrix 
factors are required to represent the necessary inter- 
processor transfer stages. These "transfer matrices” may 
be generated by permutations of identity matrices for 
selected transfer patterns on the Butterfly Network. The 
transfer structure was shown to determine the ordering of 
the transformed output sequence. The output ordering can 
be described in terms of a permutaton of the single-processor 
output produced by a given fast transform algorithn. 

The Butterfly Network also supports the Fast Hartley 


Transform, and the Walsh-Hadamard Transform. Transfer 
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patterns similar to thoserequired for the multi-processor 


FFT can also be used for these other fast transforms. The 


output orderings produced are unique, but easily decipherable. 


Many equivalent signal flowgraphs can be constructed to 


represent any given fast transform algorithm. This 


flexibility also permits many equivalent matrix factorizations 


to be developed. Constraints include the need to minimize 
additional memory requirements for buffering, or computing 
"in-place" as much as possible, and the desire to produce 
an orderly and decipherable output Sequence. Additionally, 
for distributed algorithms, an orderly pattern of inter- 
processor data transfers is also desirable. The Butterfly 
Network was shown to meet these constraints for the 
computation of distributed FFTs, and to conveniently 
Support several other fast transform algorithms. 

Further work would be desirable on developing a general 


method for finding the matrix factors of the distributed form 


of a given fast transform algorithm. An optimal implementation 


of the FHT (possibly the one given in Section III.A.2) on 
the Butterfly Network is also worthy of further research, 


as the FHT appears to have a quite promising future. 


ESS, 
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